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As  the  size  and  complexity  of  software  continues  to  grow,  it  will  be  neces¬ 
sary  for  software  construction  systems  to  collect,  maintain,  and  utilize  much  more 
information  about  programs  than  systems  do  now.  This  dissertation  explores  com¬ 
piler  utilization  of  profile  data. 

Several  widely  held  assumptions  about  collecting  profile  data  are  not  true. 
It  is  not  true  that  the  optimal  instrumentation  problem  has  been  solved,  and  it  is  not 
true  that  counting  traversals  of  the  arcs  of  a  program  flow  graph  is  more  expensive 
and  complex  than  counting  executions  of  basic  blocks.  There  are  simple  program 
flow  graphs  for  which  finding  optimal  instrumentations  is  possibly  exponential.  An 
algorithm  is  presented  that  computes  instrumentations  of  a  program  to  count  arc 
traversals  (and  therefore  basic  block  counts  also).  Such  instrumentations  impose 
10%  to  20%  overhead  on  the  execution  of  a  program,  often  less  than  the  overhead 
required  for  collecting  basic  block  execution  counts. 

An  algorithm  called  Greedy  Sewing  iiuproves  the  behavior  of  prograius  on 
luachines  with  instruction  caches.  By  luoving  basic  blocks  physically  closer  together 
if  they  are  executed  close  together  in  tiiue,  miss  rates  in  instruction  caches  can  be 
reduced  up  to  50%.  Arc-count  profile  data  not  only  allows  the  compiler  to  know 
which  basic  blocks  to  luove  closer  together,  it  also  allows  those  situations  that  will 
have  little  or  no  effect  on  the  final  performance  of  the  reorganized  prograiu  to  be 
ignored.  Such  a  low-level  coiupiler  optimization  would  be  difficult  to  do  without 
arc-count  profile  data. 

The  primary  contribution  of  this  work  is  the  development  of  TYPESET¬ 
TER.  a  prograiuiuing  system  that  utilizes  profile  data  to  select  impleiuentations  of 
prograiu  abstractions.  The  system  integrates  the  development,  evaluation,  and  se¬ 
lection  of  alternative  implementations  of  programming  abstractions  into  a  package 
that  is  transparent  to  the  programmer.  Unlike  previous  systems,  TYPESETTER  does 
not  require  programmers  to  know  details  of  the  compiler  implementation.  Experi¬ 
ence  indicates  that  the  TYPESETTER  approach  to  system  synthesis  has  considerable 
benefit,  and  will  continue  to  be  a  promising  avenue  of  research. 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 
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Chapter  1 
Introduction 


The  ‘ideal  system  of  the  future’  will  keep  profiles  associated  with  source 
programs,  using  the  frequency  counts  in  virtually  all  phases  of  a  pro¬ 
gram’s  life.  .  .  .  [I]f  it  is  to  be  a  frequently  used  program  the  high  counts 
in  its  profile  often  suggest  basic  improvements  that  can  be  made.  An  opti¬ 
mizing  compiler  can  also  make  very  effective  use  of  the  profile,  since  it  of¬ 
ten  suffices  to  do  time-consuming  optimization  on  only  one-tenth  or  one- 
twentieth  of  a  program. 

In  spite  of  the  fact  that  Knnth  made  this  pronouncement  twenty  years  ago, 
and  in  spite  of  the  fact  that  programmers  routinely  ‘optimize’  programs  by  hand 
based  on  profile  data,  Knnth’s  Dictum  (as  we  will  call  it)  still  has  not  been  fully 
implemented  in  an  automated  profiling  system  nor  shown  to  be  undesirable. 

This  dissertation  examines  the  issues  surrounding  the  utilization  of  profile 
data  in  the  compilation  of  source  code.  This  is  a  larger  subject  than  that  of  gen¬ 
erating  or  collecting  profile  data.  It  requires  asking,  at  a  minimum,  the  following 
questions:  Given  that  profile  data  exists  for  a  program,  how  might  a  compiler  make 
use  of  that  data  to  produce  better  executable  code?  What  kinds  of  profile  data  can 
be  generated/collected?  What  kinds  of  profile  data  are  useful?  How  expensive  is  this 
profiling?. 

We  start  with  two  hypotheses: 

1.  Collecting  profile  data  need  not  be  prohibitively  expensive. 

2.  Compilers  can  profitably  use  profile  data  at  all  levels  of  the  compilation  process. 

Compilers  that  emit  profiling  code  have  been  at  least  partially  implemented  by  most 
modern  systems.  But  no  systems  of  which  I  am  aware  utilize  a  program’s  profile  data 
throughout  the  compilation  process.  Furthermore,  while  these  hypotheses  might  be 
accepted  in  a  general  way,  there  are  still  some  misconceptions  about  the  cost  of 
profiling,  and  room  for  improvement  in  the  profiling  algorithms  themselves. 
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1.1  Past  uses  of  profile  data 

The  collection  and  use  of  profile  data  has  a  long  history,  beginning  with 
Knnth’s  1971  paper  [28].  In  this  paper,  the  term  ‘profile’  was  first  used,  and  defined 
to  be  the  collection  of  execution  frequency  counts  taken  during  executions  of  a  pro¬ 
gram.  Since  that  time,  the  term  ‘profile  data’  has  come  to  mean  any  quantitative 
information  gathered  about  the  rnn-time  behavior  of  a  program,  including  execu¬ 
tion  counts  of  the  prograiu  and  its  sub-parts,  reference  counts  of  the  prograiu’s  data 
objects,  and  real-time  measures  of  algorithm  executions. 

Knuth’s  exaiuination  of  execution  profiles  of  running  user  prograius  uncov¬ 
ered  two  facts:  (1)  Most  prograiumers  do  not  know  where  their  prograius  spend  most 
of  the  time;  and  (2)  even  when  programmers  analyze  their  programs,  they  still  don’t 
know  where  their  programs  spend  most  of  the  time  due  to  the  fact  that  programmers 
almost  never  have  access  to  sufficient  information  about  system  and  library  functions 
to  deduce  the  runtime  resources  they  consume.  For  example,  a  major  culprit  in  the 
FORTRAN  environment  in  which  Knuth  did  his  study  was  the  formatting  routines 
in  the  I/O  statements. 

Another  result  of  Knuth’s  study  was  the  rule-of-thumb  that  said  that  90% 
of  a  program’s  time  is  spent  in  10%  of  the  code,  variously  called  the  90-10,  or  80-20, 
rule.  Knuth  never  used  either  of  these  numbers  but  reported  that  in  his  studies  50% 
of  the  time  was  spent  in  4%  of  the  code.  In  fact,  the  actual  numbers  are  different 
for  each  program.  The  90-10  rule,  or  whatever  you  want  to  call  it,  is  one  of  the 
guiding  principles  on  which  manual  program  optimization  is  based:  find  that  section 
of  your  program  that  takes  most  of  the  runtime  resources,  and  either  modify  the 
algorithm  itself  (e.g.,  change  a  bubble-sort  into  a  quick-sort)  or  make  the  existing 
algorithm  more  efficient  at  the  low  level  (e.g.,  hoist  coiuiuon  expressions  out  of  loops, 
do  strength  reduction  on  the  index  variables,  turn  repeated  array  indexing  operations 
into  pointer  operations,  etc.). 

After  Knuth’s  1971  paper,  Dan  Ingalls  published  two  papers  describing 
descendants  of  the  FORDAP  profile  tool  used  by  Knuth.  FORDAP  was  a  basic-block 
counting  profiler.  The  first  technical  report  [23]  gives  details  of  how  the  FORTRAN 
Execution  Time  Estimator  (FETE)  adds  execution  time  estimates  to  the  frequency 
count  displays  of  FORDAP.  This  enhancement  was  proiupted  by  the  obvious  fact 
that  not  all  statements  are  created  equal.  For  exaiuple,  the  FORTRAN  statement 

A  =  B(I) 

will  execute  at  vastly  different  speeds  depending  on  whether  B  is  an  array  or  a 
function.  FETE  used  a  value  based  on  ‘weights’  assigned  to  expression  operators 
and  stateiuent  classes  to  give  a  rough  estimate  of  the  execution  tiiue.  FETE  was  not 
sophisticated  enough  to  handle  calls  on  user  functions.  If,  in  the  exaiuple  above,  B 
is  not  an  array  but  a  call  on  a  function,  FETE  will  count  it  as  an  array  reference 
unless  B  is  a  standard  FORTRAN  function. 
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Ingalls’  second  paper  [22]  describes  FORTUNE,  which  is  simply  a  renamed, 
product  version  of  FETE.  FORTUNE  and  FETE  modified  the  source  program  so 
that  it  contained  FORTRAN  statements  incrementing  elements  of  an  array  of  coun¬ 
ters.  These  counting  statements  were  placed  essentially  at  the  beginnings  of  basic 
blocks.  The  analyzer  reported  statement  execution  counts  and  estimates  of  execution 
time  for  each  statement. 

Prof  [5],  a  profile  collector  for  C,  Pascal,  and  FORTRAN  programs  on 
the  UNIX  system  is  an  example  of  profiling  tools  in  current  use.  Prof  samples  the 
program  counter  via  timer  interrupts  to  estimate  the  amount  of  time  spent  between 
the  symbols  of  the  program.  Prof's  usability  has  been  enhanced  by  Gprof ,  a  program 
developed  at  Berkeley  by  Graham,  Kessler  and  McKusick  [16].  Gprof  explicitly 
concentrates  on  procedure  calls,  providing  both  frequencies  for  calls  through  counting 
and  estimates  of  the  tiiue  spent  in  each  procedure.  Tiiuing  estimates  are  derived  by 
saiupling  the  prograiu  counter,  as  in  prof .  After  the  user’s  program  has  run,  a 
separately  invoked  post-pass  analyzer  distributes  timing  estiiuates  to  the  prograiu’s 
procedures  based  on  the  static  and  dynaiuic  call  graphs. 

There  has  been  a  siuall  aiuount  of  research  published  about  the  best  way  to 
profile  a  prograiu.  In  the  first  volume  of  his  Art  of  Programming  series,  Knuth  gives 
the  algorithm  for  determining  a  minimal  instrumentation  of  a  program  for  collecting 
the  execution  counts  of  arcs.  Knuth  and  Stevenson  [30]  (about  which  more  will  be 
said  later)  published  the  definitive  algorithm  for  finding  a  minimal  instrumentation 
of  a  program  that  counts  the  execution  of  its  basic  blocks.  Cheung  [7]  concurrently 
developed  algorithms  for  finding  minimal  instrumentations  that  count  the  frequency 
of  execution  paths  through  a  program.  A  paper  by  Sarkar  [42],  without  referencing 
this  body  of  work,  developed  an  algorithm  for  instrumenting  a  program  based  on  its 
dependence  graph. 

There  has  been  some  research  into  the  potential  uses  of  profile  data.  Gilbert 
Hansen's  research  [17]  is  an  early  investigation  into  behavior-driven  optimization.  He 
hypothesized  that,  for  certain  classes  of  software,  the  optimization  of  a  program  could 
be  done  at  run-time  more  economically  than  at  compile-time.  Instead  of  the  usual 
compile-a-file  paradigiu  that  most  compiler  systems  utilize,  Hansen’s  “adaptive” 
coiupiler  consisted  of  two  “phases” .  The  first  phase  generated  an  interpretable  foriu 
of  a  FORTRAN  prograiu  in  a  fast,  one-pass  compilation  (it  produced  ‘quads’  as  the 
interpretable  form).  The  second  part  consisted  of  an  interpreter  and  optimizer  loaded 
with  the  compiled  program.  When  the  interpreter  detected  that  a  basic  block  was 
being  executed  sufficiently  often,  interpretation  was  suspended  while  the  optimizer 
was  invoked  to  compile  the  basic  block  to  a  lower  level.  If  a  basic  block  were  executed 
often  enough,  it  would  eventually  be  compiled  down  to  machine  language. 

The  one-pass  compiler  annotated  each  basic  block  with  information  about 
its  size  and  complexity  so  the  interpreter  could  predict  profitable  optimizations.  The 
system  was  designed  to  expend  effort  only  on  optimizations  that  had  a  high  proba¬ 
bility  of  paying  for  themselves  through  iiuproved  execution  of  the  prograiu.  There 
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were  four  levels  of  optimizations,  three  of  which  were  performed  on  the  interpretable 
form  (constant  folding,  common  subexpression  elimination,  and  moving  invariants 
out  of  loops),  and  the  final  of  which  compiled  the  quads  resulting  from  the  first  three 
optimizations  into  machine  language,  an  ‘optimization’  he  called  ‘fusion’.  Therefore, 
the  optimizer  would  be  invoked  up  to  four  times  on  a  basic  block,  each  time  optimiz¬ 
ing  it  further.  When  the  basic  block  had  been  optimized  as  much  as  possible  in  its 
interpreted  form,  the  final  optimization  would  compile  it  to  machine  language.  At 
this  time,  several  machine-dependent  optimizations  would  also  be  applied  and  execu¬ 
tion  would  become  ‘threaded’:  i.e.  a  mixture  of  interpretation  and  direct  execution 
in  which  the  program  has  been  ‘fused’  into  the  interpreter. 

It  is  not  surprising  that  Hansen’s  one-pass  compiler  executes  more  quickly 
than  optimizing,  multi-pass  compilers.  What  is  surprising  is  that  those  initial  savings 
were  almost  never  depleted.  That  is,  the  time  required  to  do  the  one-pass  compilation 
to  ‘quads’,  plus  the  time  for  interpretation  and  intermittent  optimization  was  almost 
always  less  than  the  time  required  to  do  a  full  optimization  of  the  original  program 
pins  the  execution  time  of  the  optimized  program! 

Hansen’s  system  is  appropriate  for  compiling  and  running  throw-away,  rnn- 
once  programs;  e.g.  in  a  student  programming  environment,  the  overall  CPU  utiliza¬ 
tion  is  decreased.  It  is  not  an  appropriate  system  for  constructing  software  designed 
to  be  run  many  times  (e.g.  editors,  the  operating  system,  the  adaptive  compiler 
itself).  The  success  of  this  system  depended  on  the  ratio  of  the  number  of  compila¬ 
tions  to  the  number  of  executions  being  very  close  to  one.  If  a  program  is  executed 
many  times,  it  then  becomes  profitable  to  compile  and  optimize  the  whole  program 
once. 

Hansen’s  research  lends  credence  to  our  contention  that  profile-driven  op¬ 
timization  is  a  useful  adjunct  to  ‘traditional’  compilation.  If  his  adaptive  compiler 
could  use  heuristics  to  predict  the  future  behavior  of  a  program  successfully  enough, 
a  static  compiler  using  those  same  heuristics  with  complete  profile  data  should  do 
better.  The  major  point  is  that,  whereas  the  adaptive  compiler  system  is  forced  to 
make  the  assumption  that  the  performance  of  the  program  in  the  immediate  past  is 
predictive  of  its  performance  in  the  immediate  future,  a  static  compiler  can  accumu¬ 
late  profile  data,  smooth  out  anomalous  behavior  over  several  runs  of  the  program, 
and  make  the  more  accurate  assumption  that  the  average  past  performance  of  the 
program  is  a  good  predictor  of  its  average  future  behavior. 

There  has  been  a  large  amount  of  research  using  profile  data  to  improve 
virtual  memory  performance.  Most  of  this  work  has  depended  on  profile  data  in  the 
form  of  an  address  trace  and  has  improved  program  performance  by  reorganizing  the 
modules  of  programs  to  minimize  page  faults  (Ferrari  [12]  gives  a  summary  of  this 
early  work).  Nearly  all  of  this  research  has  concentrated  on  post-compilation  module- 
level  reorganization  of  a  program.  There  have  been  some  techniques  developed  to 
reorganize  programs  dynamically  based  on  their  behavior.  For  example,  K.  D.  Ryder 
[39]  and  .I.-L.  Baer  and  G.  Sager  [2]  used  dynamically  collected  profile  data  to 
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allocate  physical  memory  for  programs  rumiiiig  on  virtual  systems. 

In  spite  of  all  the  use  of  profile  data  to  improve  program  performance  in 
various  ways,  modern  programmers  wishing  to  improve  the  performance  of  their  pro¬ 
grams  using  profile  data  have  limited  options.  In  almost  every  system  that  provides 
any  profiling  capability  at  all,  it  is  still  up  to  the  users  to  analyze  the  data  and 
manually  reorganize  or  rewrite  their  programs  based  on  that  data. 

1.2  Research  Contributions 

This  dissertation  presents  several  results  related  to  the  use  of  profile  data, 
ranging  from  exactly  how  profile  data  should  be  collected,  to  actual  uses  of  profile 
data  in  languages  and  their  compilers. 

Since  at  least  the  mid-seventies  the  problem  of  efficient  profile  collection 
via  code  instrumentation  has  been  considered  a  solved  problem.  This  research  dis¬ 
covered  problems  with  the  solutions,  and  results  are  presented  here  that  show  that 
old  assuiuptions  about  profiling  are  incorrect.  Specifically,  it  is  not  the  case  that 
counting  the  execution  frequencies  of  transfers  of  control  in  a  prograiu  is  expensive: 
it  is  often  cheaper  than  simply  counting  execution  frequencies  of  the  basic  blocks  in 
a  program.  Also,  I  show  that  the  algorithius  that  have  heretofore  been  considered 
‘optiiual’  are  not  only  not  optimal,  but  that  optiiuality  is  difficult  to  achieve.  In 
Chapter  2,  I  present  an  algoritluu  that  finds  better  instrumentations  of  prograius. 

A  difficult  probleiu  froiu  past  research  has  been  iiuproving  the  perforiuance 
of  a  program  in  a  paging  environiuent.  While  paging  is  no  longer  the  issue  that  it  once 
was,  caching  in  a  meiuory  hierarchy  has  taken  its  place.  In  Chapter  3  I  demonstrate 
that  a  draiuatic  percentage  of  the  iiuproveiuent  in  the  instruction-cache  behavior  of 
a  prograiu  can  be  obtained  by  reorganizing  a  small  percentage  of  the  actual  code. 

Users  should  be  able  to  declare  a  variable  to  be  of  some  abstract  type 
without  worrying  about  the  implementation.  Unfortunately,  it  has  turned  out  to 
be  very  difficult  to  design  an  efficient  system  with  this  capability.  In  Chapter  4  I 
present  the  design  of  a  system  which  uses  profile  data  to  assign  implementations  to 
variables.  With  appropriate  language  extensions  that  allow  the  writer  of  alternative 
implementations  to  specify  what  kinds  of  profile  data  are  needed  and  how  it  is  to 
be  evaluated,  the  user  can  declare  variables  to  be  of  a  generic  type  (e.g.  Set(int)) 
and  let  the  system  decide,  based  on  the  profile  data,  which  implementation  to  use 
for  the  variable. 
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Chapter  2 

Profiling  Techniques 


How  do  I  count  thee? 

Let  me  love  the  ways  .  .  . 

(apologies  to  Ms.  Browning) 

Programmers’  intuitions  about  the  runtime  behavior  of  their  programs  are 
notoriously  bad.  Profiling  counteracts  this  deficiency  by  providing  objective  mea¬ 
sures. 

After  a  brief  discussion  of  various  profiling  techniques,  we  will  focus  on  the 
insertion  of  counting  code  in  programs  as  the  technique  of  choice.  We  will  see  that 
the  traditional  solutions  to  the  minimal  instrumentation  problem  are  not  optimal. 
An  algorithm  that  is  more  nearly  optimal  is  presented,  with  evidence  to  show  that 
true  optimality  may  be  quite  difficult  to  achieve. 

2.1  Profiling  techniques 

When  designing  a  profile  data  collection  system,  two  questions  must  be 
answered:  how.  and  how  much?  This  work  explores  the  use  of  software  instrumenta¬ 
tion  to  collect  profile  data.  This  is  in  contrast  to,  say,  using  specialized  hardware  to 
monitor  a  system  from  ‘outside’.  Such  specialized  hardware  has  been  built,  but  com¬ 
puter  manufacturers  that  provide  it  do  not  provide  high-level  language  programmers 
with  useful  tools  that  can  take  advantage  of  such  equipment.  Tools  such  as  in-circuit 
emulators  or  digital  oscilloscopes  are  primarily  useful  to  the  electrical  engineer  or,  at 
best,  to  the  writer  of  peripheral  device  drivers.  There  is  no  evidence  to  suggest  that 
such  specialized  hardware  can  provide  better  data  than  software  instrumentation  can 
provide  for  high-level  language  applications. 

There  are  three  ways  profile  data  can  be  collected  with  software  instru¬ 
mentation:  monitoring,  tracing,  and  counting.  All  three  methods  are  discussed  in 
detail  below.  Monitoring  requires  some  hardware  support,  usually  in  the  form  of  a 
countdown-timer  interrupt.  Tracing  refers  to  recording  in  memory  or  on  some  exter¬ 
nal  media  the  sequence  of  relevant  operations  of  the  program  or  system.  Counting  is 


implemented  with  code  inserted  into  the  program  to  increment  (an  array  of)  counters 
to  record  the  execution  frequencies  of  the  program. 

The  answer  to  the  ‘how  much’  question  depends  on  the  granularity  of  the 
profile  data  for  any  particular  program.  There  are  basically  four  grannlarities  in 
use:  procedure,  basic  block,  statement,  and  instruction  level  grannlarities.  There 
are  many  variations  on  these  themes,  but  this  gives  ns  enough  framework  to  discuss 
several  profiling  techniques. 

2.1.1  Monitoring 

A  histogram  of  program  execution  is  generated  by  observing  the  value  of 
the  PC  (program  counter)  at  frequent  intervals  (hence  the  alternative  name  of  ‘PC 
sampling’).  There  are  three  parameters  for  programmers  to  specify  for  luonitoring: 
the  area  (i.e.  address  range)  of  the  prograiu  that  is  to  monitored,  the  number  of  data 
points  to  be  generated  for  that  address  range  (the  granularity),  and  the  sampling 
frequency. 

For  exaiuple,  a  progranuuer  luay  specify  that  addresses  in  the  range  0x2000 
and  0x100000  are  to  be  luonitored.  and  that  65,536  data  points  are  to  be  gener¬ 
ated.  So  0x100000  —  0x2000  =  1.040.384  addresses  are  divided  into  65,536  regions, 
lueaning  that  one  data  point  will  represent  15  addresses:  we  say  that  the  lueasure- 
luent  granularity  is  15.  On  luost  luachines,  luore  than  one  instruction  can  fit  in  a 
range  of  15  addresses,  particularly  when  the  machine  is  byte-addressable.  There¬ 
fore,  information  about  luultiple  basic  blocks  totally  contained  in  a  single  15-address 
range,  or  about  basic  blocks  that  straddle  the  boundaries  of  multiple  blocks,  will  be 
fuzzy,  at  best.  The  goal  is  to  choose  a  granularity  that  does  not  generate  too  much 
inforiuation*  and  yet  captures  sufficient  inforiuation  about  the  program  to  make 
reasonable  deductions  about  its  perforiuance. 

Choosing  a  good  sample  rate  is  also  fraught  with  tradeoffs.  If  the  interrupts 
occur  too  infrequently,  too  luuch  of  the  program’s  behavior  will  occur  between  the 
interrupts.  If  interrupts  occur  too  frequently,  the  program’s  execution  will  be  totally 
swamped  by  the  interrupt  overhead.  Finding  the  proper  value  is  hit-or-iuiss,  and 
there  are  no  published  statistical  studies  showing  what  range  of  interrupt  frequencies 
sufficiently  capture  the  behavior  of  a  prograiu. 

One  problem  shared  by  almost  all  profiling  techniques  is  that  of  measuring 
system  overhead.  There  is  no  way  that  a  program  in  a  multiprogramming  environ¬ 
ment  can  reasonably  measure  the  behavior  of  the  operating  system  activity  due  to 
the  executing  program.  The  best  that  can  be  hoped  for  is  some  measure  of  the 
frequency  of  the  system  function  calls  during  the  program’s  execution. 

All  monitoring  implementations  depend  on  the  sequentiality  of  instruction 
execution  to  extrapolate  the  statistical  samples  into  information  about  the  program’s 

'Make  the  granularity  one,  and  the  profile  data  tables  will  be  at  least  as  large  as  the  program 
and  possibly  four  times  larger  than  the  program. 


execution.  Since  there  is  no  single  register  that  can  be  sampled  to  derive  a  profile 
of  a  program’s  data  reference  behavior,  it  is  extremely  difficult  to  derive  measures 
of  a  program’s  use  of  data  with  monitoring.  Even  continuous  monitoring  of,  say, 
a  data  bus  (which  would  certainly  demand  hardware  support)  cannot  provide  very 
interesting  information.  For  example,  tracing  the  data  references  in  the  area  of 
memory  devoted  to  the  execution  stack  provides  little  information.  Since  the  stack 
varies  dynamically  as  the  execution  of  the  program  proceeds,  functions  and  their 
associated  data  are  not  guaranteed  to  map  into  the  same  addresses,  nor  is  there  any 
guarantee  that  a  given  sequence  of  addresses  will  always  be  used  by  a  given  function. 
To  understand  the  memory  reference  behavior  of  a  program’s  stack  memory  requires 
some  knowledge  of  the  program’s  dynamic  call  tree,  something  a  simple  memory 
monitor  cannot  provide. 

2.1.2  Tracing 

Very  often,  simply  knowing  how  many  times  a  function  was  called,  or  having 
an  estimate  of  how  much  time  each  function  consumed,  is  not  sufficient.  Questions 
such  as  “how  often  was  event  X  followed  by  event  Y”  turn  out  to  be  important 
in  some  contexts.  In  studies  of  a  program’s  performance  in  a  paging  system,  the 
question  takes  the  form  “What  is  the  sequence  of  page  references?”,  and  similarly 
in  cache  performance  studies  the  question  is  “What  is  the  sequence  of  cache  line 
references?”.  Both  kinds  of  studies  have  traditionally  used  address  traces  (both 
instruction  and  data)  of  actual  program  executions  to  answer  these  questions. 

There  are  three  ways  to  gather  traces  (without  special  hardware  assistance): 

1.  Instrument  the  code.  This  is  a  messy  and  laborious  technique,  particularly  if 
the  instrumentation  must  not  interfere  with  the  generated  addresses.  The  trace 
can  be  considered  legitimate  for  most  purposes  only  if  the  recorded  addresses 
are  the  same  as  they  would  be  if  there  were  no  instrumentation  code. 

2.  Simulation.  Sometimes  it  is  simpler  to  write  a  simulator  for  the  machine  in 
question.  Each  instruction  simulated  can  then  produce  a  trace  of  the  instruc¬ 
tion’s  address,  all  data  references  generated  by  the  instruction,  and  perhaps 
other  information  such  as  timing  estimates. 

3.  Single-stepping.  Some  microprocessor  architectures  have  a  single-step  feature 
that  interrupts  a  process  after  each  instruction  executed.  A  separate  process 
(in  a  separate  address  space)  handles  the  interrupt  and  generates  the  trace  of 
address  references. 

Tracing  a  program  usually  is  quite  expensive,  causing  instrumented  pro¬ 
grams  to  run  anywhere  from  2  (if  you’re  lucky)  to  10  times  slower,  depending  on 
the  actual  method  of  tracing  and  the  number  of  execution  features  being  traced. 
Furthermore,  tracing  can  produce  tremendous  amounts  of  data.  With  memories  and 
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programs  getting  larger,  it  may  take  many  millions  of  instructions  of  trace  data  to 
capture  interesting  effects.  Borg  et  al.  report  that  some  interesting  characteristics 
of  traces  were  not  apparent  until  several  billion  instructions  had  executed  [6].  Com¬ 
paction  techniques  can  alleviate  some  of  the  problem  as  I  have  demonstrated  [41], 
but  the  management  and  assimilation  of  huge  amounts  of  data  is  always  difficult. 

Again,  even  if  these  problems  are  solved,  the  trace  of  a  program  on  a  virtual 
memory  system  is  not  sufficient  for  capturing  effects  introduced  by  the  operating  sys¬ 
tem.  Tracing  the  operating  system  alone  may  not  provide  the  necessary  information 
either.  Tracing  a  complete  system  across  system  calls  and  interrupts  is  a  gargantuan 
task,  and  produces  a  prodigious  amount  of  inforiuation  in  a  siuall  aiuount  of  time. 
Agarwal,  Sites,  and  Horowitz  instrumented  a  processor’s  micro-code  to  collect  such 
systeiu-level  traces  [1],  but  it  is  difficult  for  prograiuiuers  to  utilize  this  technique  to 
gain  an  understanding  of  how  their  programs  and  the  system  interact. 

2.1.3  Counting 

Many,  if  not  luost,  compilers  now  have  the  ability  to  insert  counting  code 
into  users’  prograius  and  provide  very  precise  information  about  the  nuiuber  of  times 
lines,  basic  blocks,  or  functions  are  called.  The  actual  instruiuentation  is  quite 
siiuple,  even  for  separately  coiupiled  units  of  a  program.  Since  this  is  luy  preferred 
method  of  collecting  profile  data,  I  discuss  it  in  luore  detail  in  the  next  section. 

Counting  executions  of  functions  is  probably  the  least  useful  granularity. 
Prograiumers  learn  which  are  the  luost  frequently  called  functions,  but  that  may 
bear  very  little  relation  to  the  location  of  the  luost  frequently  executed  inner  loop. 
Counting  lines  is  not  always  sufficient  since  “lines  of  code”  are  artifacts  of  particular 
languages  and  the  styles  of  individual  prograiuiuers.  For  example,  some  programs 
written  in  C  can  have  many  basic  blocks  concealed  on  a  single  line  of  code  due 
to  some  programmers’  proclivity  for  using  C’s  conditional  expression  construct  in 
deeply  nested  macros.  The  net  result  is  insufficient  information  when  more  than  one 
basic  block  is  on  a  line,  and  siiperfiiioiis  counts  when  basic  blocks  span  lines. 

Any  modern  compiler  should  implement  some  form  of  counting,  and  at  a 
minimum  it  should  include  basic  block  counting.  In  the  next  section,  however,  I 
argue  that  counting  execution  frequencies  of  transfers  of  control  {arc  counting)  is 
best. 


2.2  Efficient  Counting 

Most  compilers  that  implement  profiling  via  the  insertion  of  code  count 
lines  or,  at  best,  basic  blocks.  However,  there  are  some  applications  where  basic 
block  counts  are  insufficient.  Examples  include  code  reorganization  to  iiuprove  the 
performance  of  luulti-level  memory  hierarchies  [21,35],  jump  optimization,  and  code 


10 


Figure  2.1:  Block  counts  are  insufficient.  The  sub- graph  on  the  right  is  the  same 
as  the  sub-graph  on  the  left  with  the  addition  of  an  arc  from  block  C  to  block  A. 
Knowing  the  execution  counts  of  the  blocks  does  not  allow  the  derivation  of  the  arc 
counts. 

generation  and  register  assignment  [25].  Such  optimizations  and  code  transforma¬ 
tions  depend  on  knowing  branch  probabilities  (i.e.  arc  frequencies)  and  can  make 
only  imprecise  use  of  block  counts.  Arc  frequencies  frequently  cannot  be  deduced 
from  block  frequencies:  an  example  program  graph  is  in  Figure  2.1,  where  knowing 
that  each  block  is  executed  n  times  does  not  provide  enough  information  to  deter¬ 
mine  the  number  of  times,  say,  execution  of  block  A  is  followed  by  execution  of 
block  D.  This  is  not  a  contrived  example.  Also  shown  in  Figure  2.1  is  an  actual 
program  flow-graph  (PFG)  that  contains  the  problematic  graph.  This  flow  graph 
was  generated  frequently  by  a  Pascal  compiler  for  while  loops. 

Therefore,  arc  frequencies,  not  just  block  frequencies,  are  desired,  although 
historically  block  frequencies  have  been  considered  more  desirable.  To  elaborate  fur¬ 
ther  on  the  problem  of  minimal  instrumentation  for  arc  counts,  we  will  need  some 
definitions.  We  assume  that  the  execution  cost  of  inserted  instrumentation  code 
is  constant  and  non-zero;  call  this  cost  Kj.  In  Section  2.5,  we  will  discuss  some 
subtleties  in  instrumentation  code,  but  for  the  moment  we  will  assume  that  each 
instance  of  instrumentation  code  is  exactly  the  same  (e.g.,  a  memory-to-memory  in¬ 
crement  operation,  or  an  equivalent  sequence  of  operations).  When  instrumentation 
is  inserted  in  code,  it  may  be  necessary  to  insert  a  jump  instruction  to  maintain 
the  semantics  of  the  code.  We  will  assume  that  the  execution  cost  of  this  jump 
instruction  is  also  constant,  Kj  >  0. 
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A  program  flowgraph  is 

•  a  set  y  of  basic  blocks,  the  vertices  of  the  graph; 

•  a  set  of  directed  edges,  which  are  pairs  of  vertices,  e  =  {src{e),  snk{e)).  We 
say  that  the  edge  e  leaves  src(e)  and  enters  snk{e),  the  source  and  sink  of  the 
edge,  respectively.  The  edge  is  an  exit  arc  of  src(e)  and  an  entrance  arc  of 
snk(e). 

•  a  distinguished  edge  of  the  graph,  cq;  we  assume,  without  loss  of  generality, 
that  snk{eQ),  the  entrance  block  of  the  graph,  has  no  other  predecessors,  and 
src(eo),  the  exit  block  of  the  graph  has  no  other  successors;  each  flowgraph  has 
exactly  one  entrance  block  and  exactly  one  exit  block. 

For  each  edge  e  of  the  graph,  F{e)  is  the  frequency  count  of  e;  J(e)  is  a 
boolean  function  that  is  true  if  e  is  an  out-of-line  jump  arc,  and  false  if  e  is  a  fall- 
through  arc;  C  is  a  function  that  maps  edges  into  instrumentation  costs.  C(e)  is  the 
cost  required  to  instrument  edge  e,  and  depends  on  F(e),  Kj,  and  Kj:  specifically, 
C{e)  =  Ci{e)  =  F{e)Kj  if  it  is  not  necessary  to  insert  a  juiup  instruction,  or  C(e)  = 
Cj(e)  =  F{e){Kj  Kj)  if  a  juiup  instruction  is  required. 

Each  block  v  is  the  sink  of  at  most  one  fall-through  arc,  and  the  source  of 
at  most  one  fall-through  arc.  There  is  no  liiuit  on  the  nuiuber  of  arcs  for  which  a 
block  is  a  sink  or  a  source.  We  define  in{v)  to  be  the  set  of  predecessors  of  u,  and 
out{v)  to  be  the  set  of  successors  of  v.  We  say  that  an  arc  e  is  crowded  at  its  sink  if 
|m(snA;(e))|  >  1.  Likewise,  it  is  crowded  at  its  source  if  \outfsr c{e))\  >  1.  If  an  arc 
is  crowded  both  at  its  source  and  at  its  sink,  we  siiuply  say  that  it  is  crowded.  We 
define  the  predicates  e.crowdedSnk,  e.crowdedSrc,  and  e. crowded,  on  the  edge  e  for 
these  conditions. 

The  remainder  of  this  discussion  assmues  that  Kj  >  0  and  Kj  >  0;  all  of 
the  examples  we  will  display  assume  Kj  =  Kj  =  1. 

All  of  the  following  algorithius  take  advantage  of  the  fact  that,  given  a 
program  flow  graph,  all  execution  frequencies  of  the  arcs  and  nodes  can  be  derived  if 
we  know  an  appropriate  \E\  —  |W|  -|-  2  arc  frequencies.  Such  a  subset  of  arcs  can  be 
selected  by  finding  those  arcs  that  foriu  the  coiuplement  of  a  (non-directed)  spanning 
tree  in  the  flow  graph.  If  the  arcs  have  associated  costs,  then  a  luiniiual  (luaximal) 
cost  subset  of  arcs  can  be  found  by  taking  those  arcs  that  foriu  the  complement  of  a 
maximal  (minimal)  spanning  tree  in  the  program  flow  graph;  a  proof  can  be  found 
in  Kniith.  Vol  1.,  page  368  [29].  I  refer  to  the  algorithm  for  finding  a  spanning  tree 
as  the  SPAN  algorithm,  and  to  the  algorithm  for  finding  the  minimal  (maximal) 
spanning  tree  as  AdlNSPAN  (MAXSPAN).  All  of  this,  with  the  exception  of  the 
arc  characterization  function  J  and  slight  differences  in  notation,  is  consistent  with 
previous  work. 
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2.2.1  Basic  block  counts 

The  most  commonly  implemented  instrumentation  technique  counts  the 
number  of  executions  of  every  basic  block,  which  we’ll  refer  to  as  FULLNODE. 
Techniques  that  count  every  line  or  every  prograiuming  language  stateiuent  are  even 
more  inefficient  variants  of  this  technique.  FULLNODE  has  the  advantage  of  being 
the  easiest  to  implement,  but  the  disadvantage  of  being  inefficient:  a  prograiu  can 
easily  be  slowed  down  by  as  luuch  as  50%  to  100%,  depending  on  the  execution  cost 
of  the  instruiuentation  and  the  average  size  of  a  prograiu ’s  basic  blocks. 

However,  it  is  not  necessary  to  instrument  each  and  every  basic  block  to  get 
complete  block  counts.  Knuth  and  Stevenson  [30]  and  Cheung  [7]  present  algorithms 
that  compute  a  minimal  subset  of  basic  blocks  that  when  instrumented  provide 
sufficient  data  to  recover  the  execution  frequencies  of  all  other  basic  blocks.  (Cheung 
also  contains  a  much  more  detailed  study  of  minimal  instrumentation,  including 
minimal  instrumentation  for  determining  path  coverage.)  A  TYPESETTER  version 
of  Knuth  and  Stevenson’s  algorithm  is  included  in  Appendix  A.  I  will  refer  to  this 
algorithm  as  the  K-S  instrumentation  algorithm. 

Conceptually,  the  K-S  algorithm  is  a  graph  transformation  followed  by  an 
application  of  a  spanning  tree  algorithiu.  Given  a  program-flow  graph  with  basic 
blocks  V  and  edges  E,  the  relation  =  between  basic  blocks  is  defined  to  be  the 
smallest  equivalence  relation  such  that  a  =  b  if  there  exists  vertex  c  and  arcs  c  ^  a 
and  c  ^  h.  A  reduced  graph  is  produced  whose  vertices  LJ,  are  the  equivalence  classes 
of  the  original  graph,  and  whose  edges  correspond  one-to-one  with  the  basic  blocks 
of  the  original  graph;  for  each  basic  block  h  G  V ,  there  exists  an  edge  in  V(,  froiu 
the  equivalence  class  containing  h  to  the  class  containing  the  successors  of  h  (by 
construction,  they  are  all  in  the  saiue  class). 

SPAN  is  applied  to  the  reduced  graph  to  find  a  spanning  tree,  and  those 
edges  of  E^  not  in  the  spanning  tree  specify  the  nodes  in  V  that  need  to  be  instru¬ 
mented.  The  K-S  algorithm  also  coiuputes  the  expressions  that  will  later  allow  the 
frequencies  of  all  nodes  to  be  computed  in  one  pass. 

If  we  have  soiue  notion  of  the  execution  behavior  of  the  program,  then  each 
node  u  of  a  prograiu  flow  graph  can  be  assigned  an  instrumentation  cost,  KjE{v).  A 
minimal  cost  instrumentation  of  the  program  flow-graph  can  be  found  by  following 
the  steps  for  the  K-S  algorithm,  but  using  MAXSPAN  to  find  the  spanning  tree 
rather  than  SPAN.  We’ll  call  the  minimum  cost  algorithm  MINNODE. 

2.2.2  Arc  counts 

A  naive  implementation  of  arc  counting  in  a  program  flow  graph  would 
be  to  instrument  each  and  every  arc  of  the  original  graph.  This  would  provide 
the  necessarv  counts,  but  rather  expensivelv.  The  cost  comes  from  two  facts:  (1) 
m  >  ivi,  meaning  more  space  would  be  required  by  the  instrumentation  code; 
and,  (2)  to  instrument  some  arcs  requires  the  creation  of  a  new  basic  block,  and  the 
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addition  of  a  jump  instruction.  For  such  arcs,  the  instrumentation  cost  would  be 
Kj  +  A'j,  whereas  the  instruiuentation  cost  for  all  nodes  is  simply  Kj.  These  facts 
have  contributed  to  a  coiumonly  held  belief  that  arc  counting  is  too  expensive  and 
coiuplicated  for  practicality.  As  we  shall  see,  this  siiuply  is  not  the  case. 

As  in  the  case  of  basic  block  counts,  it  is  not  necessary  to  lueasure  each 
and  every  arc  to  derive  the  nuiuber  of  tiiues  each  was  traversed.  A  miniiuuiu  set 
of  arcs  to  be  lueasured  in  a  graph  consists  of  those  arcs  not  in  the  spanning  tree 
constructed  by  SPAN,  and  the  luinimal  cost  instrumentation  is  the  set  of  arcs  that 
form  the  complement  of  the  spanning  tree  found  by  MAXSPAN.  We  will  call  the 
resulting  algoritluu  AdlNARC. 

As  it  stands,  AdlNARC  is  too  expensive.  As  noted  above,  instruiuenting  an 
arc  requires  the  creation  of  a  new  basic  block  that  is  inserted  in  the  arc  between  its 
source  and  sink  nodes.  For  jump  arcs,  this  new  basic  block  luust  also  contain  a  jump 
to  the  original  target  of  the  arc:  this  juiup  adds  to  the  cost  of  the  instruiuentation. 
This  cost  is  often  ameliorated  by  using  transformations  to  turn  arc  measurements 
into  node  measurements  wherever  possible.  For  instance,  if  an  edge  that  is  to  be 
instrumented  represents  the  fall-through  of  one  basic  block  into  another,  and  the 
edge  is  the  only  edge  leaving  the  source  block,  then  the  instrumentation  can  be 
inserted  in  the  edge’s  source  block;  similarly,  if  the  edge  is  the  only  edge  entering  the 
target  block,  then  the  instrumentation  can  be  inserted  in  the  edge’s  target  block. 

Procedure  1  implements  a  heuristic  algorithm  using  these  transformations 
to  reduce  the  cost  of  profiling  arc  traversals,  where 
ISINK(e)=Add  instrumentation  code  to  the  front  of  the  block  snk{e). 
ISOURCE(e)=Add  instrumentation  code  to  the  front  of  src{e). 
and 

ISPLIT(e)=R.eplace  edge  e  with  a  new  basic  block  and  edges  ei  =  (src(e),  v^)  and 
^2  =  {Ve^  snk{e)).  Furthermore,  T(ei)  =  T(e2)  =  T(e),  and  F'(ei)  =  F{e2)  =  F{e). 
(As  we  will  see,  determining  C{ei)  and  C{e2)  is  problematic.) 

These  transformations  allow  us  to  keep  the  problem  relatively  siiuple.  If 
we  added  transformations  that  allowed  us.  say,  to  luove  basic  blocks  to  reiuove  jump 
instructions,  or  to  make  the  most  frequent  arc  out  of  a  block  the  fall-through  arc 
from  that  block,  the  problem  would  be  luuch  more  complicated.  To  keep  the  problem 
simple,  we  assume  that  the  linear  order  of  basic  blocks  in  lueiuory  will  remain  the 
same  throughout  the  process  of  instruiuenting  the  program.  Our  only  options  are  to 
determine  where  to  add  instrumentation  code. 

Before  the  MAXSPAN  algorithm  can  be  applied  to  the  program  flow  graph 
to  find  which  arcs  are  to  be  instrumented,  the  cost  of  instrumenting  each  arc  must 
be  estimated.  If  at  all  possible,  we  would  like  to  insert  instrumentation  code  without 
introducing  extra  control  flow  logic  or  extra  basic  blocks,  so  if  we  can  identify  where 
the  above  transformations  can  be  applied,  we  can  more  accurately  estimate  the 
cost  of  instrumenting  an  arc.  This  leads  to  the  following  heuristic  cost  estimation 
algorithm. 
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Procedure  1  Instrumenting  an  arc: 

Input:  An  arc  that  is  to  be  instrumented. 

Result:  Instrumentation  code  is  added  to  the  PFG  to  count  executions  of  the  arc. 
Method: 

Instrument{Arc  e) 

{ 

if  not  e.crowdedSrc  then  ISOURCE{e)] 
elsif  not  e.crowdedSnk  then  ISINK{e)\ 
else  ISPLIT{e)\ 

endif: 

} 


□ 


Function  2  Estimate  cost  of  instrumenting  an  arc: 

Input:  An  arc  in  the  PFG. 

Output:  An  estimate  of  the  cost  of  instrumenting  the  arc  should  it  be  chosen  for 
instrumentation. 

Method: 

InstrumentationCost{Arc  e) 

{ 

if  not  e. crowded  then 
return  Kj  x  F[e)\ 
elsif  not  J{e)  then 
return  Kj  x  F[e)\ 
else  return  (Ah  +  Kj)  x  A(e); 
endif; 

} 


□ 

If  the  instrumentation  code  can  be  appended  to  a  basic  block,  then  the  cost 
of  instrumenting  the  arc  e  is  the  cost  of  the  instrumentation  code  itself,  Ah,  times 
the  frequency  of  execution  A(e).  If  the  arc  is  a  fall-through  arc  then  the  cost  is  still 
KjF{e)  even  if  the  arc  is  crowded:  the  instrumentation  can  be  inserted  between  the 
two  blocks.  In  all  other  cases,  the  source  block  of  the  arc  will  be  jumping  to  the 
instrumentation  code,  which  will  itself  have  to  jump  to  the  sink  block;  therefore, 
the  cost  of  instrumenting  the  arc  is  (A7  -|-  Kj)F{e),  where  Kj  is  the  cost  of  the 


Figure  2.2:  Another  graph  with  cheaper  instrumentation. 


jump  instruction.  We  will  call  A/IINARC  augmented  with  the  heuristic  placement 
algorithm  MIN  ARC'. 

This  approach  looks  good,  and  I  will  present  results  below  to  show  that  it 
is  effective,  but  there  is  nothing  to  suggest  that  it  is  complete  or  produces  minimal 
instrumentations.  Rather,  it  is  a  set  of  ad  hoc  rules  for  utilizing  the  results  of  the 
MINARC  algorithm.  That  it  is  not  complete  can  be  seen  from  Figure  2.2^.  If  arc 
c  is  to  be  instrumented,  and  if  the  execution  frequencies  of  a  or  6  can  be  derived 
independently  of  c,  then  the  instrumentation  on  c  can  be  moved  into  the  node  A. 
The  frequency  of  arc  c  is  then  F{c)  =  F{A)  —  F{a),  since  F{a)  =  F{h). 

This  suggests  another  way  of  asking  the  question.  We  have  an  algorithm 
that  will  find  the  minimum  set  of  nodes  for  computing  node  counts,  and  an  algorithm 
for  finding  a  minimum  set  of  arcs  for  computing  arc  (and  therefore  node)  counts:  is 
there  an  algorithm  that  will  find  a  minimum  set  of  nodes  and  arcs  for  computing 
execution  frequencies  for  a  program  flow  graph? 

An  extension  to  the  above  algorithms  produces  a  candidate  algorithm, 
which  Fll  (presumptuously  and  inaccurately)  call  OPT.  Given  a  flow  graph  {V,E), 
we  construct  a  new  one  {V',E')  as  follows.  For  each  node  in  V  we  create  a  corre¬ 
sponding  node  v'  G  V .  For  each  arc  e  =  Vi  ^  V2  E  E,  we  construct  a  basic  block 
G  V  and  two  arcs  v[  and  ^  in  E' .  So  \V'\  =  \V\  +  \E\  and  \E'\  =  2|i?|. 

The  K-S  algorithm  applied  to  the  new  graph  (V',  E')  will  yield  a  minimum 
set  of  nodes  in  V  required  to  compute  all  frequencies  of  all  nodes  in  V .  Since  each 
node  selected  for  instrumentation  in  V  corresponds  to  either  a  node  or  an  arc  of  the 
original  graph,  we  also  have  a  minimum  set  of  nodes  and  arcs  of  {V,E)  from  whose 
measurement  we  can  compute  all  other  execution  frequencies  of  nodes  and  arcs  in  F. 
We  extend  the  algorithm  to  find  the  minimal  cost  set  of  nodes  and  arcs  by  assigning 
the  same  costs  to  the  nodes  in  {V'.E')  as  are  estimated  for  the  nodes  and  arcs  in 
{V,E)  from  which  they  are  constructed.  Therefore,  if  a  node  v'  G  V  corresponds  to 

^Thanks  to  .Jim  Wilson  of  Cygnns  Corporation  for  pointing  this  example  out  and  for  taking  the 
time  to  convince  me  it  was  worth  a  second  look. 
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MINARC 

nl  ^  n2 

100 

n3  ^  n4 

27 

n4  ^  n5 

27 

n5  ^  n3 

54 

n5  ^  n6 

81 

n‘2  n3 

80 

n‘2  n4 

80 

total 

449 

MINOPT 

111 

100 

n3  ^  n4 

27 

n4  ^  n5 

27 

n5  ^  n3 

54 

n5  ^  n6 

81 

n3 

67 

n4 

67 

total 

423 

Figure  2.3:  A  program  flow-graph  for  which  MINARC  is  not  optimal 


node  n  G  y,  then  C{v')  =  C(v)  (which  is  always  Kj).  If  a  node  v'  G  V  corresponds 
to  the  arc  e  G  then  C{v')  =  C{e).  By  assigning  these  costs  and  invoking  a 
maximal  spanning  tree  algorithm,  we  find  the  minimal  cost  instrumentation  using 
nodes  and  arcs. 

That  this  algorithm,  which  we’ll  call  MINOPT,  is  not  equivalent  to  MINARC', 
and  can  sometimes  improve  on  the  measurement  costs  of  a  program  graph  is  easily 
proved.  In  Figure  2.2  AdlNOPT  always  instruments  node  A  instead  of  arc  c,  unless 
of  course  c  is  executed  a  lot  less  than  once  per  execution  of  A. 

Finding  PFG  instances  on  which  MINOPT  produces  different  instrumen¬ 
tations  than  AdlNARC'  is  a  bit  more  difficult.  Figure  2.3  is  the  second  simplest  sub¬ 
graph  I  have  found  that  demonstrates  this  difference  (the  simplest  is  Figure  2.2).  In 
Figure  2.3  each  arc  is  labeled  with  an  execution  frequency,  and  with  whether  it  is 
a  jump  arc  or  a  fall-through  arc.  The  two  tables  show  the  results  of  the  MINARC' 
and  MINOPT  algorithms.  The  first  column  contains  the  objects  chosen  to  be  mea¬ 
sured,  and  the  second  column  contains  the  cost  of  measuring  that  object  (assuming 
Kj  =  Kj  =  1).  The  major  difference  in  the  results  of  the  two  algorithms  is  that 
MINOPT  has  chosen  two  nodes  to  measure,  while  MINARC'  can  choose  only  arcs, 
and  then  look  for  transformations  to  decrease  the  cost.  The  instrumentation  trans¬ 
formations  described  previously  do  not  help  MINARC'  in  this  example.  The  only 
one  that  applies  is  ISOURCE(nl  ^  n2). 
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Figure  2.4:  The  problem  configuration 

While  MINARC'  does  almost  as  well  as  MINOPT,  it  is  heuristic  and  not 
minimal.  MINOPT  is  provably  minimal  (with  respect  to  a  set  of  instrumentation 
cost  estimates),  and  can  find  instrumentations  that  would  be  difficult  to  characterize 
easily  as  post-transformations  for  MINARC.  Given  that  the  complexity  of  both  A4I- 
NARC  and  MINOPT  is  0(|E|  log|E|)  MINOPT  is  to  be  preferred  for  its  simplicity 
and  minimality. 

2.2.3  The  ‘optimal’  algorithms  are  not  optimal 

The  assignment  of  minimal  instrumentation  costs  to  the  edges  of  a  program 
fiowgraph  has  been  glossed  over  in  the  literature.  In  fact,  such  an  assignment  cannot 
always  be  done  unambiguously  so  as  to  guarantee  a  minimal  solution.  That  is  to  say, 
all  optimal  solutions  shown  in  the  literature  are  optimal  with  respect  to  a  specific  in¬ 
strumentation  cost  assignment  on  the  arcs.  But  until  this  time  no  one  has  examined 
the  question  of  how  those  costs  are  assigned,  or  even  if  they  can  be  assigned,  and 
whether  such  an  assignment  still  permits  an  efficient  optimal  solution.  For  instance, 
Cheung  mentions  that  instrumenting  some  arcs  requires  extra  flow-control  instruc¬ 
tions,  [7,  pp.  38-39],  but  his  algorithms  assume  that  these  costs  can  be  assigned  in 
linear  time,  and  that  the  cost  of  instrumenting  one  arc  does  not  affect  the  cost  of 
instrumenting  other  arcs. 

That  these  assumptions  do  not  hold  for  even  very  simple  cases  can  be  seen 
in  Figure  2.4  where  the  arcs  to  node  C  are  both  jump  arcs  and  are  both  crowded. 
The  cost  assignment  algorithm  described  above  would  assign  instrumentation  costs 
of  F{A  C){Kj  4-  Kj)  and  F{B  C){Kj  +  Kj)  to  the  arcs.  However,  if  either 
or  both  arcs  are  chosen  for  instrumentation,  it  is  obvious  that  one  of  them  does  not 
have  to  jump  to  block  C,  but  rather  can  create  a  new  basic  block  that  simply  falls 
through  to  C .  In  Figure  2.5  we  assume  that  both  arcs  going  to  basic  block  C  are 
to  be  instrumented.  The  arc  B  ^  C  is  instrumented  by  placing  its  instrumentation 
code  in  a  separate  basic  block  which  falls  through  to  the  basic  block  C .  The  instru¬ 
mentation  for  the  arc  A  ^  C  cannot  be  so  configured  and  hence  must  use  the  more 
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Figure  2.5:  ISPLIT  vs.  ISPLITCHEAP 

expensive  method  for  splitting  the  arc.  We  can  add  this  transformation  to  our  list 
of  transformation  heuristics  on  page  13: 

ISPLITCHEAP(e):  Replace  edge  e  with  a  new  basic  block  and  edges  ei  = 
{src{e),Ve)  and  62  =  {Vg,  snk{e))]  J(ei)  =  J(e)  and  J(e2)  =  false. 

If  in  order  to  instrument  edge  e  we  must  use  ISPLIT(e),  then  the  instrumentation 
cost  assigned  to  edge  e  must  be  Cj(e)  =  F{e){Kj+Kj).  If  we  use  ISPLITCHEAP(e), 
then  the  cost  is  Ci{e)  =  F{e)Kj. 

Given  the  above  example,  it  is  easy  to  see  that  assigning  an  accurate  in¬ 
strumentation  cost  to  an  arc  when  all  that  is  known  is  that  arc’s  frequency  is  not 
possible:  we  do  not  know  until  the  completion  of  the  algorithm  which  set  of  arcs  must 
be  instrumented,  and  therefore  we  don’t  know  whether  an  arc  will  need  to  be  ISPLIT 
or  whether  it  can  be  ISPLITCHEAP.  Surprisingly,  it  is  not  possible  to  assign  correct 
instrumentation  costs  to  the  arcs  in  the  above  situation  even  when  the  frequencies 
of  all  arcs  entering  a  node  are  taken  into  account.  A  proof  by  counter-example  is 
given  below  in  Section  2.4. 

The  two-step  process  of  assigning  instrumentation  costs  to  arcs  and  then 
applying  a  maximal  spanning  tree  algorithm  does  not  always  produce  an  optimal 
instrumentation.  There  may  be  a  polynomial-time  algorithm  for  finding  an  optimal 
instrumentation,  but  as  of  this  writing,  I  do  not  know  what  it  is.  If  there  are  p 
instances  of  nodes  like  C  in  Figure  2.4,  then  finding  the  optimal  instrumentation 
could  require  examining  2"^  spanning  trees,  trying  all  n  cost  assignments  at  each  of 
the  p  problem  nodes.  I  conjecture  that  the  problem  is  NP-complete.  A  fruitful  line 
of  search  for  a  proof  might  begin  with  Szymanski’s  NP-completeness  proof  for  the 
variable-span  branching  problem  [47],  which  has  some  of  the  same  characteristics  as 
the  instrumentation-cost  problem. 
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Even  so.  finding  the  optimal  minimal-cost  solution  in  practice  does  not 
result  in  sufficiently  more  efficient  instrumentations  to  warrant  an  intense  search  for 
a  general  solution.  MINOPT  works  quite  adequately  in  practice.  The  non-optimality 
of  the  solution  matters  little  because 

•  we  show  in  Section  2.3  that  profiling  requires  less  than  20%  overhead,  anyway; 

•  the  problem  configuration  arises  only  under  nnnsnal  circumstances  (it  occurred 
in  only  2.4%  of  the  20457  basic  blocks  in  onr  experiments);  and, 

•  of  those  instances  in  which  it  does  occur,  the  performance  degradation  is  esti¬ 
mated  to  be  less  than  1%. 

Throughout,  whenever  I  refer  to  ‘optimal  instrumentation’  algorithms,  I  mean  opti¬ 
mal  with  respect  to  the  instrumentation  estimates. 


2.3  Empirical  data 

I  created  a  system  that  inserts  optimal  arc-connting  in  programs  and  used 
it  to  instrument  several  C  programs  by  soiue  of  the  algorithius  luentioned  in  the 
previous  section.  The  prograius  were  first  instruiuented  without  the  benefit  of  pro¬ 
file  data;  the  frequencies  of  all  arcs  and  nodes  was  one,  resulting  in  a  luore-or-less 
randoiu  selection  of  instrumentation  points  by  the  algorithius  (‘random’  in  the  sense 
that  the  selected  instrumentation  points  were  the  result  of  vagaries  of  selecting  a 
spanning  tree  from  quick-sorting  equi-valued  elements).  Next,  a  heuristic  was  used 
to  assign  relative  frequencies  to  arcs:  back- arcs  and  their  target  nodes  were  given 
higher  frequencies  than  the  rest  on  the  assumption  that  a  back-arc  indicates  a  loop. 
Finally,  profile  data  was  used  to  compute  a  minimal-cost  instrumentation. 

The  results  are  presented  in  Table  2.1.  Four  programs  were  compiled  with 
gcc  -0  and  instrumented:  intmm,  an  integer  matrix  multiply;  compress,  the  UNIX 
compression  utility;  troff.  the  UNIX  typesetting  program;  and  ccl  of  the  gcc  com¬ 
piler.  The  first  column  for  each  program  shows  the  running  time  of  the  program  in 
CPU  seconds,  while  the  second  column  shows  the  running  times  of  the  instrumented 
versions  of  the  programs  expressed  as  a  percentage  of  the  original  running  time  (more 
precisely,  if  N  is  the  running  time  of  the  program  without  any  instrumentation,  and 
P  is  the  running  time  of  the  program  with  instrumentation  inserted,  then  the  second 
column  =  100  x  {{N / P)  —  1).  All  programs  were  run  on  a  Sun  3/140  and  were  com¬ 
piled  with  the  GNU  C  compiler,  version  1.37.1.  All  running  tiiues  are  the  average  of 
10  runs  to  smooth  out  system-dependent  fluctuations. 

The  intmm  program  had  no  input  data,  and  the  contents  of  the  luatrices 
were  initialized  the  same  for  each  run  (whether  they  were  or  not  would  not  have 
luade  any  difference  to  the  running  of  the  algoritluu).  The  running  times  shown 
reflect  the  best  that  the  various  instrumentations  could  do  for  that  program  based 
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means 

intmm 

compress 

troff 

ccl 

not  instrumented 

29.06 

17.03 

96.91 

28.91 

prof 

29.93  3.0% 

18.37  7.9% 

127.91  32.0% 

36.08  24.8% 

gprof 

31.45  8.2% 

21.22  24.6% 

164.81  70.1% 

46.38  60.4% 

FULLNODE 

32.08  10.4% 

26.78  57.3% 

175.18  80.8% 

42.40  46.7% 

MINNODE 

random 

heuristic 

profile 

31.92  9.8% 

31.83  9.5% 

31.73  9.2% 

20.58  20.8% 
21.32  25.2% 
19.90  16.9% 

138.83  43.3% 
139.97  44.4% 
130.44  34.6% 

36.40  25.9% 
37.55  29.9% 
35.72  23.6% 

MINARC' 

random 

heuristic 

profile 

33.36  14.8% 
33.38  14.9% 
33.32  14.7% 

20.83  22.3% 
20.26  19.0% 
19.94  17.1% 

131.40  35.6% 
131.68  35.9% 

117.40  21.1% 

36.46  26.1% 
37.01  28.0% 
35.21  21.8% 

MINOPT 

random 

heuristic 

profile 

33.31  14.6% 
32.93  13.3% 
31.89  9.7% 

20.39  19.7% 
20.15  18.3% 
19.89  16.8% 

131.24  35.4% 
141.56  46.1% 
118.99  22.8% 

36.35  25.7% 
37.85  30.9% 
34.90  20.7% 

Table  2.1:  Profiling  overhead 


on  the  profile  data:  the  profile  would  be  exactly  the  same  each  run.  MINOPT’s  9.7% 
overhead  vs.  MINARC'’s  14.7%  reflects  the  fact  that  the  inner  loop  in  intmm  mirrors 
exactly  the  situation  in  Figure  2.2.  MINOPT  was  able  to  find  an  instrumentation 
that  did  not  require  the  expensive  arc-splitting  that  MINARC'  was  required  to  do. 

For  each  of  the  other  three  prograius,  different  input  was  used  to  create  the 
profile  data  than  was  used  create  the  nuiubers  in  the  table.  For  compress,  the  profile 
data  was  generated  by  coiupressing  compress,  c,  the  source  file  for  the  utility.  The 
numbers  in  the  table  are  froiu  coiupressing  /usr/dict/words,  a  200Kb  file  containing 
a  sorted  list  of  25,144  words.  Troff's  profile  data  was  created  by  typesetting  a  48Kb 
language  reference  summary,  and  the  numbers  in  the  table  are  from  typesetting  a 
190Kb  technical  report  on  a  bibliographic  database  browser  [49]. 

The  ccl  profile  data  was  created  by  compiling  gcc.c,  the  23Kb  source  file 
for  the  process-dispatching  front-end  of  the  gcc  compiler,  and  combine. c,  a  46Kb 
source  file  for  compile-time  constant  expression  evaluation  for  the  same  compiler. 
The  measured  run  compiled  cccp.p,  the  73Kb  source  file  for  the  Gnu  C-preprocessor. 
The  sizes  of  the  source  files  are  after  all  pre-processing  commands  and  all  comments 
were  stripped. 

Using  different  input  for  profiling  than  for  timing  runs  is  necessary  to  con¬ 
vince  us  that  we  simply  aren’t  ‘training’  the  profile  algorithius  to  a  specific  prograiu. 
However,  it  does  introduce  soiue  anoiualies  in  the  numbers  in  Table  2.1.  For  instance, 
MINARC'.  using  profile  data  to  instruiuent  troff,  resulted  in  the  instrumented  pro¬ 
graiu  taking  21.1%  longer  to  run  than  did  the  uninstrumented  version.  This  contrasts 
with  MINOPT  using  profile  data  on  the  same  program:  the  instrumented  version  of 
troff  required  22.8%  longer  (the  difference  is  statistically  significant  and  not  due  to 
variations  in  measurement).  Obviously,  the  instrumentation  selected  by  MINOPT 
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intium 

compress 

troff 

cel 

prof 

36813 

44189 

66309 

324357 

gprof 

20992 

25304 

42128 

204160 

FULLNODE 

88 

1492 

10332 

68636 

MINNODE 

56 

796 

6336 

48216 

MINAPC' 

56 

808 

6408 

49040 

MINOPT 

56 

808 

6408 

49040 

Table  2.2:  Comparison  of  profile  data  size  requirements 


does  not  do  as  well  on  the  set  of  troff  input  as  did  the  instrumentation  selected  by 
MINARC'.  Such  variation  due  to  variations  in  the  input  are  to  be  expected. 

The  nuiubers  also  indicate  that  the  heuristic  I  used  to  try  to  guess  which 
would  be  the  heavily  used  arcs  and  nodes  in  the  PFG  is  quite  inadequate:  it  is  better 
siiuply  to  accept  a  random  assignment  (i.e.,  if  there  is  no  profile  data  available). 
Perhaps  a  closer  analysis  of  average  aggregate  program  behavior  could  produce  a 
heuristic  that  would  do  better  than  just  randoiu  chance. 

Froiu  the  data  presented  in  Table  2.1  we  can  see  that  MINOPT  is  defi¬ 
nitely  coiupetitive  with  MINNODE,  coiupares  favorably  with  prof  and  gprof  ^  and 
is  definitely  better  than  FULLNODE,  the  traditional  technique  for  instruiuenting 
prograius. 

Another  benefit  of  MINOPT  is  deiuonstrated  in  Table  2.2.  It  shows  the 
number  of  bytes  required  for  the  generated  profile  data.  In  all  fairness,  coiuparing 
the  output  of  pro/ and  gprof  with  the  others  is  comparing  apples  and  oranges.  You 
can’t  get  execution  counts  of  basic  blocks  from  prof  oi  gprof,  but  then  you  can’t  get 
tiiuing  estiiuates  from  the  others.  Howver,  the  difference  between  FULLNODE  and 
the  MIN  algorithius  is  significant. 


2.4  Counter-example 

In  general,  it  is  not  possible  to  assign  instrumentation  costs  to  arcs  in  such 
a  way  that  an  optimal  instrumentation  can  be  found  using  MAXSPAN.  To  prove 
this,  it  suffices  to  show  that  there  exists  one  program  flow  graph  for  which  this  is 
true.  To  this  end.  we  construct  a  subgraph  and  show  that  for  any  instrumentation 
cost  assignment  algorithiu,  the  subgraph  can  be  embedded  in  a  larger  graph  that 
causes  MAXSPAN  to  select  a  non-optimal  instrumentation. 

Figure  2.6  shows  a  portion  of  a  program  flow  graph  that  satisfies  the  criteria. 
Basic  block  C  has  two  entering  arcs  that  are  both  crowded  juiup  arcs.  We  luore-or- 
less  arbitrarily  assign  execution  frequencies  to  the  arcs.  The  arc  A  ^  C  is  executed 
200  tiiues,  the  arc  B  ^  C  90  times.  We  assume  that  the  cost  of  instruiuentation 
A/  =  1  and  the  cost  of  a  juiup  instruction  Kj  =  1.  The  values  for  these  constants 
could  be  any  non-zero  value  and  we  would  still  be  able  to  find  our  count er-exaiuple. 
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Figure  2.7:  Case  1:  Estimating  that  the  most  frequent  arc  is  split  expensively. 


but  that  is  not  necessary  to  prove  here:  we  need  to  show  only  that  there  exists  one 
graph  for  which  the  algorithm  is  non-optimal. 

The  graph  surrounding  the  sub-graph  in  Figure  2.6  is  not  shown,  but  is 
constructed  such  that  the  MAXSPAN  algorithm  will  put  nodes  A,  B,  and  C  in  the 
spanning  tree  last.  This  is  easily  done  by  creating  the  surrounding  PEG  such  that 
each  node  has  at  least  one  entrance  or  exit  arc  with  a  frequency  count  higher  than 
any  arcs  in  the  sub-graph.  The  nodes  in  the  sub-graph  are  added  to  the  spanning  tree 
by  selecting  the  arc  in  the  sub-graph  with  the  largest  frequency.  The  arc  is  chosen 
from  among  all  of  the  arcs  shown  in  Figure  2.6,  including  the  entrance  and  exit  arcs 
of  all  three  nodes.  Exactly  three  of  these  arcs  will  be  selected  by  the  algorithm  to 
complete  the  spanning  tree. 

Before  invoking  MAXSPAN,  instrumentation  costs  must  be  assigned  to 
each  arc.  The  question  is:  does  there  exist  an  algorithm  for  assigning  costs  to  the 
arcs  A  ^  C  and  B  ^  C  that  has  as  its  inputs  only  the  frequencies  of  those  arcs? 
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Figure  2.8:  Cases  2  and  3:  Estimating  that  the  least  frequent  arc  is  split  expensively, 
or  that  both  arcs  are  split  expensively. 


Figure  2.9:  Case  4:  Estimating  that  neither  arc  is  split  expensively. 
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If  such  an  algorithm  existed  it  would  produce  one  of  four  results:  it  would 
assign  the  expensive-split  cost  to  A,  assign  it  to  B,  assign  it  to  both  arcs,  or  assign 
it  to  neither.  Figures  2.7  through  2.9  show  that  no  matter  which  assignment  is 
made,  there  exists  a  consistent  set  of  arc  frequencies  which  will  cause  the  MAXSPAN 
algorithm  to  misfire  and  select  the  wrong  arcs  for  instrumentation  -  wrong  in  the 
sense  of  being  non-optimal.  In  each  figure,  each  arc  is  labeled  with  its  frequency, 
and  with  its  cost  estimate  in  parenthesis  if  different  from  the  frequency  times  JF/- 
For  example,  in  Figure  2.7  the  arc  A  ^  C  has  frequency  200,  but  its  assigned  cost 
is  A7  +  B'j  =  2  times  that,  or  400.  Given  these  instrumentation  cost  assignments, 
the  arcs  selected  by  the  MAXSPAN  algorithm  are  shown  in  bold.  So  in  Figure  2.7, 
the  cost  of  measuring  the  sub-graph  is  1  -|-  40  4-  51  4-  101  4-  1  4-  289  4-  1  =  492.  The 
dashed  lines  show  a  better  spanning  tree  resulting  in  a  cheaper  instrumentation:  in 
Figure  2.7  that  cheaper  instrumentation  would  cost  14-404-514-1014-200-1-1-1-1  =  395. 

Figure  2.8  shows  a  set  of  frequencies  for  which  guessing  that  the  least 
frequent  arc  (B  C)  is  expensively  split,  or  that  both  arcs  are  expensively  split, 
will  also  fail.  Again,  the  bold  arcs  are  the  ones  chosen  by  the  A4AXSPAN  algorithm 
for  inclusion  in  the  spanning  tree,  while  the  dashed  lines  show  a  better  selection,  one 
resulting  in  a  cheaper  instrumentation. 

Figure  2.9  shows  that  assuming  both  arcs  will  be  ISPLITCHEAP  (an  im¬ 
possibility  in  actuality)  does  not  work  either.  The  arcs  selected  by  MAXSPAN  result 
in  an  instrumentation  that  costs  1  4-  1  4-  101  4-  200  4-  180  4-  11  4-  1  =  495.  If  neither 
A  ^  C  nor  B  ^  C  is  put  in  the  spanning  tree,  then  both  will  be  measured.  After 
the  MAXSPAN  algorithm  chooses  them  for  instrumentation,  it  is  easy  to  see  that 
of  the  two  it  is  better  to  ISPLITCHEAP  the  most  heavily  used  arc;  hence  the  cost 
of  180  for  instrumenting  the  lesser  executed  arc  B  ^  C.  It  would  have  been  better 
not  to  measure  B  ^  C .  a.s  shown  by  the  dashed  lines.  This  better  instrumentation 
would  cost  only  1  +  100  +  1  +  101  4-  200  4-  11  4-  1  =  415. 

Therefore,  an  algorithm  for  assigning  instrumentation  estimates  to  arcs  does 
not  exist  that  has  as  its  only  inputs  the  frequencies  of  all  arcs  entering  a  node  and 
that  depends  on  the  MAXSPAN  algorithm  to  find  the  minimum  instrumentation. 


2.5  Problems  with  counting 

There  are  several  complications  that  must  be  considered  when  implement¬ 
ing  a  profiling  system.  The  first  is  determining  the  point  during  compilation  when 
the  profiling  code  is  to  be  inserted.  The  system  I  used  operated  on  the  assembly 
language  output  of  the  GNU  C  compiler.  However,  if  the  profiling  code  is  inserted 
earlier  by  the  compiler,  then  it  is  easy  to  finesse  some  of  the  task  of  counting.  For 
instance  simple  loops  (i.e.,  reducible  sub-graphs  with  no  mid-loop  exits)  that  are  ex¬ 
ecuted  a  compile-time  constant  number  of  times  do  not  need  to  increment  a  counter 
on  each  execution  of  the  loop.  In  general,  if  the  execution  frequency  of  an  arc  in  a 
program  flow  graph  is  known  a  priori  that  arc  can  be  removed  from  the  program-flow 
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graph  prior  to  computing  the  set  of  instrumentation  points.  If  the  removal  of  the  arc 
results  in  dis-connected  sub-graphs,  each  sub-graph  is  treated  separately.  Even  when 
the  loop  is  not  executed  a  constant  number  of  times,  but  where  strength-reduction 
can  hoist  the  count  increment  out  of  the  loop,  the  loop  need  not  be  instrumented. 
Rather  the  number  of  executions  can  be  counted  outside  the  loop  and  the  counter 
incremented  only  once.  Sarkar  [42]  shows  one  method  for  doing  this  using  depen¬ 
dency  graphs.  However,  it  is  impossible  to  tell  from  his  paper  exactly  how  much 
counting  overhead  is  actually  reduced  by  his  technique.  Further  study  and  better 
numbers  are  needed  here. 

All  of  the  the  results  in  Section  2.3  computed  each  function’s  instrumenta¬ 
tion  assuming  that  the  number  of  times  each  function  was  called  had  to  be  counted. 
That  is,  the  functions  were  instrumented  one  at  a  time  without  knowledge  of  how 
or  from  where  the  function  was  called^.  However,  if  we  have  profiled  the  entire  pro¬ 
gram,  the  number  of  times  a  function  is  executed  is  simply  the  sum  of  the  execution 
frequencies  of  all  call  sites  that  call  this  function:  there  is  no  reason  to  recompute 
that  number  in  the  instrumentation  of  the  function.  Not  all  programs  execute  syn¬ 
chronously,  as  we  have  implicitly  assumed  throughout  this  discussion,  nor  do  all  call 
sites  call  only  one  function.  If  functions  are  called  indirectly,  for  example  by  inter¬ 
rupt  handling  facilities,  then  it  is  mandatory  that  each  function  be  instrumented 
separately  to  compute  the  number  of  times  it  was  called. 

The  point  to  be  made  here  is  that  these  optimizations  are  possible  only  if 
the  instrumentation  code  is  inserted  prior  to  the  optimizing  passes  of  the  compiler. 

Another  problem  for  post-pass  instrumentation  is  related  to  the  machine 
architecture.  If  the  processor’s  instruction  set  makes  use  of  condition  flags,  and  if  the 
life  of  condition  flag  values  extends  across  basic  blocks,  then  the  profile  code  must 
preserve  those  values.  This  can  be  done  either  by  saving  and  restoring  the  values 
of  the  condition  flags  around  the  instrumentation  code  (the  method  used  in  our 
experiments),  using  an  instrumentation  sequence  that  does  not  change  the  setting  of 
the  condition  flags,  or  by  scanning  the  basic  block  and  inserting  the  instrumentation 
just  before  the  first  instruction  that  kills  but  does  not  use  the  condition  flags  [50]. 
This  situation  is  much  more  naturally  handled  in  the  compiler  proper  than  in  a 
post-processor. 


2.6  Conclusions 

I  have  demonstrated  that  counting  arcs  is  as  cheap  as  counting  nodes,  and 
cheaper  than  counting  every  node.  The  MINOPT  algorithm  produces  instrumenta¬ 
tions  that  take  50-70%  of  the  space  required  by  instrumentations  produced  by  the 
FULLNODE  algorithm,  execute  in  70-80%  of  the  time  and  provide  significantly  more 

^This  method  was  encouraged  by  troff.  the  only  widely-nsed  non-interactive  program  I  know  of 
that  uses  inter-procednral  gotos  as  a  major  form  of  control  flow. 
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information.  MINOPT’s  instrumentations  require  approximately  the  same  amount 
of  time  as  MINNODE’s  and  require  slightly  less  space,  but,  again,  they  provide  sig¬ 
nificantly  more  information.  (In  the  next  section  I  present  an  optimization  that  uses 
arc  counts  and  would  not  work  with  only  node  counts.)  MINOPT  should  be  the 
instrumentation  algorithm  of  choice  for  compilers/systems. 
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Chapter  3 

A  Low-level  Use:  Code 
Reorganization  for  Instruction 
Cache  Performance 

3.1  Instruction  cache  utilization 

If  a  compiler  has  profile  data  available  there  are  simple  optimizations  that 
can  take  advantage  of  the  data.  This  chapter  explores  a  simple  optimization  that 
requires  profile  data.  Specifically,  I  will  show  how  to  utilize  information  about  the 
runtime  behavior  of  a  prograiu  to  enhance  the  performance  of  that  program  on 
architectures  with  an  instruction  cache. 

There  have  been  luany  investigations  into  iiuproving  coiuputer  performance 
by  reorganizing  prograius’  address  spaces  on  virtual  memory  luachines.  In  this  chap¬ 
ter.  I  address  the  question  of  whether  reorganization  can  be  beneficial  for  machines 
with  caches  and  examine  the  costs  required  to  achieve  iiuproved  perforiuance.  If  an 
inexpensive  way  can  be  found  to  reorganize  the  address  space  of  a  prograiu  such  that 
a  small  cache  with  code  reorganization  can  have  the  performance  of  a  larger  cache 
without  reorganization,  the  smaller  inexpensive  caches  would  be  a  more  competitive 
choice. 

Instructions  that  are  executed  close  together  in  time  are  temporally  local. 
Instructions  that  are  close  together  in  the  address  space  are  physically  local.  A 
cache  turns  temporal  locality  into  physical  locality  by  holding  the  most  recently 
executed  instructions  in  faster  memory.  Exactly  how  a  cache  should  be  implemented 
in  hardware  and  which  strategies  for  replacing  the  data  in  the  cache  are  topics 
that  have  received  a  great  deal  of  study.  For  an  overview  of  cache  designs  and 
organizations,  see  Smith’s  survey  article  [45]. 

If  we  let  C  represent  a  cache,  where  each  line  i  of  the  cache  has  an  address 
C{i).addr  and  contents  C{i).instr  (the  contents  is  an  instruction  for  instruction 
caches),  then  when  address  a  is  referenced,  the  cache  is  exaiuined  to  see  if  a  is 
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Figure  3.1:  Removing  cache  contention  by  reorganizing 


already  in  the  faster  memory.  If  a  =  C{i).addr  for  some  i,  then  the  contents  of  that 
line  is  returned  as  the  contents  of  the  referenced  memory  address. 

A  fnlly-associative  instruction  cache  is  one  that  searches  in  parallel  each 
line  of  the  cache  for  the  referenced  address;  i.e.,  if  C{i).addr  =  a  for  some  i,  then 
return  C{i).instr.  In  a  direct-mapped  instruction  cache,  on  the  other  hand,  the 
address  and  its  contents  can  be  in  only  one  line  of  the  cache.  Which  line  is  (usually) 
determined  by  the  low-order  bits  of  the  address;  i.e.,  if  C(lower  bits  of  a).addr  =  a 
then  return  C(lower  bits  of  a).instr.  Where  in  a  fully  associative  cache  an  address 
and  its  contents  can  be  placed  in  any  line  of  the  cache,  in  a  direct-mapped  cache, 
an  address  and  its  contents  can  be  put  in  only  one  place;  hence  the  name  direct- 
mapped.  The  hardware  required  to  do  the  parallel  search  is  expensive  to  build,  while 
a  direct-mapped  cache  is  much  simpler  and  less  expensive. 

In  either  case,  we  are  interested  in  several  statistics  as  indicators  of  how 
well  a  cache  performs.  A  critical  statistic  is  the  miss  ratio:  the  number  of  times  an 
address  was  referenced  and  it  was  not  found  in  the  cache.  The  dual  of  the  miss  ratio 
is  the  hit  ratio:  the  number  of  times  an  address  was  referenced  and  found  in  the 
cache.  By  definition,  then,  miss  ratio  =  1—  hit  ratio. 

There  are  only  two  ways  to  improve  the  performance  of  a  program  in  a 
cache:  (1)  decrease  the  probability  that  frequently-executed  sections  of  the  program 
compete  for  cache  resources  (Figure  3.1);  and  (2)  increase  the  amount  of  useful 
information  in  the  cache  (Figure  3.2). 

In  Figure  3.1,  assume  the  code  at  block  A  and  the  code  at  block  B  are  the 
active  portions  of  a  loop.  Assume  further  that,  due  to  the  size  of  the  infrequently 
executed  block  C,  blocks  A  and  B  are  mapped  to  the  same  locations  in  the  direct- 
mapped  cache,  as  shown  on  the  left.  The  loop  containing  these  blocks  can  be  made 
more  efficient  by  moving  A  and  B  with  respect  to  one  another  so  that  they  do  not 
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Figure  3.2:  Improving  cache  utilization  by  reorganizing 


conflict  in  the  cache,  as  shown  on  the  right. 

In  Figure  3.2  assume  the  blocks  A,  B,  and  C  are  of  such  a  size  that  A 
and  B  could  fit  in  a  cache  line.  The  cache  lines  on  the  left  represent  one  way  they 
might  map  into  a  direct-mapped  cache,  with  the  side-effect  of  loading  infrequently 
executed  code  from  block  C  into  both  lines.  In  the  cache  on  the  right  the  initial 
cache  miss  that  loads  the  code  from  A  also  loads  B,  saving  at  least  one  cache  miss 
in  the  execution  of  the  loop;  also,  infrequently  executed  code  from  C  takes  up  much 
less  space. 

For  a  fully  associative  cache  there  may  be  ways  of  reorganizing  a  program 
to  improve  its  performance  with  respect  to  (2);  little  can  be  done  as  far  as  (1)  is 
concerned.  With  direct-mapped  caches,  however,  both  (1)  and  (2)  suggest  easy  ways 
to  gain  performance  improvement.  In  a  direct-mapped  cache,  contention  is  a  function 
of  the  addresses  of  the  competing  program  segments,  which  is  easily  controlled  by  a 
loader  and/or  compiler. 

Mark  Hill  argues  that  direct-mapped  caches  are  not  only  cheaper  and  eas¬ 
ier  to  build  [20],  they  also  can  give  equivalent  performance  as  more  complex  cache 
arrangements  for  the  same  silicon  acreage  invested.  I  have  developed  an  algorithm 
called  Greedy  Sewing  that  uses  arc  counts  to  reorganize  code  for  improved  perfor¬ 
mance  in  direct-mapped  caches,  and  that  is  independent  of  the  parameters  of  the 
target  cache. 

While  there  have  been  published  results  for  organizing  data  in  memory  to 
improve  cache  performance,  there  has  been  little  published  regarding  rearranging  the 
instruction  space.  For  instance  .lanet  Fabri  [11]  and  K.  O.  Thabit  [48]  both  discuss 
methods  for  improving  the  cache  behavior  of  a  program’s  data  accesses,  but  say  little 
about  the  behavior  of  the  code  itself. 

Some  recent  work  on  code  reorganization  include  Scott  McFarling’s  work 
at  Stanford  [34],  work  done  at  Hewlett-Packard  by  Pettis  and  Hansen  [36],  and  Hwn 
and  Chang’s  work  on  combining  reorganization  with  the  in-lining  of  procedures  [21]. 
McFarling’s  work  differed  from  mine  by  concentrating  on  positioning  basic  blocks 
based  on  their  frequency  counts  and  by  utilizing  knowledge  of  the  target  cache.  Hwn 
and  Chang  extended  my  work  by  doing  actual  in-lining  (as  opposed  to  my  pseudo- 
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inlining).  Pettis  and  Hansen  pretty  much  duplicated  my  work,  with  the  exception 
that  their  algorithm  reorganizes  the  entire  program  instead  of  just  those  areas  where 
the  vast  majority  of  the  improvement  is  gained. 

3.2  Greedy  Sewing 

A  program  control-flow  digraph  (or  program  flow  graph  or  PFG  for  short) 
is  a  set  of  basic  blocks  V  and  directed  arcs  E.  If  e  =  v  ^  w  for  nodes  v,w  E  V 
then  we  define  src(e)  =  v  and  snk{e)  =  w.  Associated  with  each  arc  e  (basic 
block  v)  in  the  graph  is  a  positive  integer  F{e)  {F{v))  representing  the  number 
of  times  this  arc  (basic  block)  was  executed  during  the  execution  of  the  program 
(so  S  =  F{e)  =  JZiiey  F{v)).  Associated  with  each  basic  block  v  are  functions 

onThread(v)  that  returns  the  thread  that  basic  block  v  is  on,  onHead(v)  that  returns 
true  if  the  block  v  is  at  the  head  of  its  thread,  and  onTail(v)  that  returns  true  if 
V  is  at  the  tail  of  its  thread.  Given  threads  G  T,  we  define  the  procedure 
append(t,Si,S2,. .  .  J  to  concatenate  the  threads  Si  in  order  onto  thread  t,  and  delete 
threads  Si  from  T.  The  functions  first{t)  and  last{t)  return  the  first  and  last  blocks, 
respectively,  on  the  thread  t. 

The  basic  idea  is  to  sew  threads  together  such  that  the  order  of  the  basic 
blocks  in  a  thread  tends  to  improve  the  correspondence  between  the  static  spatial 
locality  of  basic  blocks  with  their  dynamic  temporal  locality.  We  define  the  function 
canStitch(u,v)  to  return  true  if  the  nodes  u  and  v  can  be  concatenated  onto  the  same 
thread:  this  is  true  only  \f  u  ^  v  and  onHead(u)  and  onTail(v)  are  both  true. 

We  define  the  procedure  Stitch(e)  to  ‘sew’  two  threads  together: 

Procedure  3  Stitching  Basic  Blocks: 

Input:  An  edge  e  in  a  PFG;  a  set  of  threads  of  basic  blocks  T. 

Result:  If  the  source  and  sink  blocks  of  e  can  be  concatenated,  the  set  T  is  modified 
such  that  it  contains  one  less  thread  due  to  the  concatenation  of  the  two  member 
threads. 

Method: 

Stitch{e:  Arc) 

begin 

if  canStitch{e)  then 

append^  onThread{src{e))  ,  onThread{snk{e)))] 

end 


□ 

Using  these  functions,  we  are  now  ready  to  lay  out  a  preliminary  version  of 
the  Greedy  Sewing  algorithm. 
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Algorithm  4  Greedy  Sewing  Algorithm(l) : 

Input:  A  PFG  (G,  E),  and  a  parameter  p  such  that  0  <  p  <  1;  and  a  set  T  of  threads 
that  are  initialized  such  that  each  basic  block  is  on  its  own  thread:  after  initialization 
onHead(v)  and  onTail(v)  are  true  for  all  v  G  V . 

Result:  The  set  of  threads  T  is  modified  to  indicate  the  relative  ordering  of  the  basic 
blocks  in  V.  A  thread  t  E  T  specifies  the  order  in  which  the  basic  blocks  are  to 
be  placed  contiguously  in  memory.  There  is  no  implied  ordering  of  basic  blocks  on 
different  threads. 

Method:  The  parameter  p  is  used  to  specify  what  portion  of  the  arcs  will  be  examined. 
That  is,  a  set  of  arcs  A  will  be  processed  where  for  all  a  G  A,F{a)  >  F(h)  for  all 
h  ^  A,  and  <  P  *  F[e).  That  is,  setting  p  =  .90  would  cause  the 

main  loop  of  the  algorithm  to  be  repeated  until  sufficient  arcs  had  been  processed 
to  account  for  90%  of  all  arc  traversals.  This  lueans  that  usually  5-10%  of  the  arcs, 
and  even  fewer  basic  blocks,  need  be  reorganized. 

Greedy{p\  real;  A:  Set{Arc)) 

begin 

assert(0  <  p  <  1); 

S  ^  px 
s  ^  0 

while  (s  <  S)  do 

Select  e  E  A  such  that  F[e)  is  maximum. 

E  ^E-  {e} 
s  ^  s  -|-  F[e) 

if  canStitch{src{e) ,  snk{e))  then 
Stitch{src{e) ,  snk{e)) 

endwhile 

end 


□ 

In  preliminary  tests  of  the  greedy  algorithm,  several  situations  were  ob¬ 
served  that  this  simple  algorithm  did  not  handle  adequately.  While  the  majority  of 
program  improvement  comes  from  the  simple  version  of  Greedy  Sewing,  it  does  not 
take  very  many  special  cases  to  eat  into  those  savings.  Based  on  observations  in  these 
preliminary  runs,  the  simple  algorithm  was  enhanced  with  some  checks  for  special 
cases.  For  instance,  consider  the  flow  graph  in  Figure  3.3.  If  the  path  A  ^  B  ^  D  is 
the  more  frequently  executed  path,  then  the  thread  ABD  will  be  formed  (1).  Since 
C  cannot  be  sewn  to  either  A  or  D  now,  it  reiuains  a  singleton  thread  (2).  If  it  is 
very  infrequently  executed,  then  it  luakes  little  difference  where  C  is  placed  relative 
to  the  thread  ABD.  However,  if  the  path  A  ^  C  ^  D  is  only  slightly  less  frequently 
executed  than  the  path  A  ^  B  ^  D,  and  if  the  siiiu  of  the  sizes  of  A,  B,  C ,  and  D 
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Figure  3.3:  Two  threads  from  if-theii-else 


are  small  enough  that  they  might  all  fit  in  a  cache,  then  the  single  thread  ABCD  in 
Figure  3.3(3)  is  preferable  over  the  two  threads  (1)  and  (2). 

Since  the  Greedy  Sewing  Algorithm  is  general  and  does  not  depend  on  any 
particular  cache  configuration  or  size,  it  cannot  know  whether  any  set  of  basic  blocks 
will  fit  in  a  cache,  and  so  uses  a  heuristic  to  attempt  to  capture  instances  of  this 
configuration  of  basic  blocks.  The  function  isSmallEquiC ondl  checks  for  basic  blocks 
matching  exactly  this  configuration — i.e.  the  basic  block  ends  with  a  conditional 
jump  instruction,  the  two  arms  of  the  conditional  have  at  most  one  basic  block  in 
them  and  are  very  nearly  eqni-probable — and  when  found  the  procedure  StitchCondl 
creates  the  longer  thread.  The  actual  mechanics  of  putting  a  small  if-then-else  on  a 
thread  is  straightforward  in  procedure  StitchCondl  (see  the  next  page). 

A  second  common,  but  more  complicated,  situation  is  pictured  in  Figure  3.4 
where  a  procedure  P  is  called  from  a  basic  block  A.  We  want  to  concatenate  block 
Pr  with  block  B  because  the  same  considerations  that  applied  to  the  previous  if- 
then-else  example  apply  here:  if  the  frequent  path  through  the  procedure  is  small 
enough  such  that  A,  the  frequently  executed  portions  of  P,  and  B  could  fit  in  the 
cache,  then  we  would  like  to  construct  the  thread  shown  on  the  right  of  Figure  3.4. 
The  procedure  StitchCall  effectively  constructs  the  arc  P^.  ^  B  such  that  P^  and 
B  are  eventually  made  contiguous.  When  the  bottom  of  A  is  sewn  to  the  top  of 
Po,  we  say  that  procedure  P  has  been  pseudo-inlined.  A  basic  block  containing  a 
single  jump  instruction  is  inserted  between  the  call  and  the  target  to  maintain  the 
semantics  of  the  original  code. 

That  is  the  simple  view  of  psendo-inlining.  It  is  complicated  by  the  fact  that 
at  the  time  we  invoke  StitchCall  on  the  arc  e  (using  the  notation  in  the  example  in 
Figure  3.4,  it  will  be  one  of  arc  A  ^  B  ot  arc  A  Pq,  depending  on  the  vicissitudes 
of  the  sorting  algorithm)  we  have  not  yet  encountered  the  return  block  and  may 
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Function  5  isSmallEquiCondl: 

Input:  An  arc  e  E  E. 

Output:  Returns  true  if  src(e)  is  the  head  of  a  small  if-then-else  with  approximately 
eqni-probable  true  and  false  arms.  A  global  constant  S  defines  what  is  meant  by 
eqni-probable. 

Method: 


isSmaUEquiCondl{e:  Arc)  return  boolean 
begin 

if  src(e)  ends  with  a  conditional  branch  [and  therefore  has 
two  exit  arcs  e  and  w) 
and  snk[e)  has  one  exit  arc 
and  snk[w)  has  one  exit  arc 

and  snk[ex)  ==  snk[Wx) - they  go  to  the  same  block 

and  snk[ex)  !=  src[e) - they  do  not  make  a  loop 

and  snk[e^)  !=  src(e^) 

and  (|F(e)  -  F[w)\/[F[e)  +  F[w)))  <  8 

then 

return  true 

else 

return  false 


end 


□ 


Procedure  6  StitchCondl: 

Input:  An  arc  e  E  E  that  is  one  arm  of  a  basic  block  that  ends  in  a  conditional 
branch  instruction.  Assumes  that  isSmallEquiCondl(e)  is  true. 

Result:  A  new  thread  is  added  to  T  that  contains  the  four  basic  blocks  of  the  if- 
then-else. 

Method:  Let  w  and  be  defined  as  in  the  function  isSmallEquiCondl.  Then  a  new 
thread  is  created  by  concatenating 

append[onThread[src[  e)), 
onThread[snk[e)), 
onThread[snk[w) ), 
onThread[snk[e^  ))) . 

□ 
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Figure  3.4:  Pseudo-inlining 

not  encounter  it  if  it  is  not  one  of  the  hot  spots  of  the  program.  During  the  normal 
operation  of  Greedy  Sewing  all  blocks  ending  with  a  return  instruction  will  end  up 
at  the  tail  of  a  thread:  there  are  no  exit  arcs  from  a  return  block.  At  the  same  time, 
we  do  not  want  basic  block  B,  the  one  following  the  call  instruction,  to  be  threaded 
with  a  less  frequently  occurring  basic  block. 

Let  the  function  target{v)  return  the  basic  block  that  is  the  target  of  the 
call  instruction  that  ends  block  v  (undefined  otherwise),  and  let  returnsTo{v)  be  the 
basic  block  to  which  the  called  procedure  returns.  Then,  whenever  an  arc  e  is  selected 
for  which  isCaU(src{e))  is  true  (i.e.  the  basic  block  src(e)  ends  with  a  call  instruc¬ 
tion),  the  procedure  StitchCall  (Procedure  7)  modifies  block  B  =  returnsTo{src{e)) 
so  that  onH ead{B)  is  false  (even  though  B  is  still  (on)  a  singleton  thread)  and  adds 
src(e)  to  a  set  of  remembered  basic  blocks  TZ.  At  the  end  of  the  Greedy  Sewing 
Algorithm,  the  procedure  append-RMocks  (Procedure  8)  is  invoked  to  append  all  of 
the  blocks  r  G  77  to  the  appropriate  threads. 

The  entire  Greedy  Sewing  Algorithm  used  in  our  experiments  is  given  in 
Algorithm  9. 


3.3  Results 

I  used  the  profile-collection  techniques  of  the  previous  chapter  to  collect 
arc  frequencies  of  several  programs.  After  profile  data  was  collected,  the  program 
reorgBBs  then  read  the  original  assembly  language  files  and  reorganizes  them  based 
on  that  profile. 

3.3.1  The  Programs  and  Traces 

There  were  three  programs  chosen  for  experimentation  and  each  program 
had  four  versions  created:  the  normal,  nnreorganized  version  produced  by  the  Gnu 
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Procedure  7  StitchCall: 

Input:  An  arc  e  £  E  such  that  isCall{src(e))  is  true. 

Result:  The  target  of  the  call  iu  src(e)  is  pseudo-iuliued  into  the  thread  coutaiuiug 
src(e).  The  set  of  threads  T  and  the  set  of  remembered  basic  blocks  R  are  modified 
so  that  the  pseudo-iuliuiug  can  be  completed  later. 

Method: 

StitchCaU{e:  Arc) 

begin 

assert{isC  aU{sr  c{e))) 

t  tar get{sr c{e)) 

r  ^  returnsTo{src{e)) 

append{onThread{src{e),t)) 

onHead{r)  ^  false: 

add  src{e)  to  IZ 

end 


□ 


Procedure  8  append_R_blocks: 

Input:  The  set  R  of  return  blocks;  the  set  of  threads  T. 

Result:  The  threads  containing  the  return  blocks  are  appended  to  the  threads  con¬ 
taining  the  corresponding  call  blocks.  All  return  blocks  are  at  the  head  of  a  thread, 
even  though  StitchCall  modified  them  to  appear  otherwise. 

Method: 

appeneERJ)locks{IZ\  Set{Node)) 

begin 

for  each  e  £  TZ  do 

append[onThread[e) ,  onThread{target{e))) 

endfor 

end 


□ 
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Algorithm  9  Greedy  Sewing  Algorithm(2) : 

Input:  A  PFG  {V,E)]  a  parameter  p  such  that  0  <  p  <  1;  and  a  set  of  threads  T 
that  are  initialized  such  that  each  basic  block  is  on  its  own  thread;  after  initialization 
of  T  onHead(v)  and  onTail(v)  are  true  for  all  v  G  V . 

Result:  The  set  of  threads  T  is  modified  to  indicate  the  relative  ordering  of  the  basic 
blocks  in  V . 

Method:  The  parameter  p  is  used  to  indirectly  specify  what  portion  of  the  arcs  will 
be  examined. 

Greedy{p\  real;  E:  Set{Arc)) 

begin 

S  ^  px  Y.e&EF{e) 
s  ^  0 

n^i!) 

while  (s  <  S)  'do 

Select  e  ^  E  such  that  E{e)  is  maximum. 

E  ^E-{e] 
s  ^  s  A  E{e) 

if  isSmallEquiGondl{e)  then 
StitchGondl{e) 
else  if  isGall{e)  then 
StitchGall{e) 

else  if  canStitch(src(e) ,  snk{e))  then 
Stitch(src(e),  snk{e)) 

endwhile 

appeneERJ)locks{IZ)] 

end 

□ 
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name 

trace  length 
nnreorganized 

trace  1 

p  =  .80 

ength  reorg 
p  =  .90 

anized 
p  =  .95 

SCRUNCH 

TROFF 

CCl 

9,405.156 

8.059.174 

8.263,593 

9.656,437 

8.343,440 

8.268.313 

9,715,946 

8.325,470 

8.293.376 

9,693,277 

8.338.350 

8.301,226 

Table  3.1:  Summary  of  traces:  number  of  instruction  words  fetched 


name 

number  of 
basic  blocks 

nnmbe 

p  =  .80 

;r  of  blocks  reo: 
p  =  .90 

rganized 

p  =  .95 

SCRUNCH 

TROFF 

CCl 

1.233 

4.000 

26,407 

22  (1.8%) 
149  (3.7%) 
727  (2.8%) 

27  (2.2%) 
207  (5.2%) 
1,317  (5.0%) 

32  (2.6%) 
318  (8.0%) 
1,939  (7.3%) 

Table  3.2:  Summary  of  traces:  number  of  basic  blocks 


C  compiler,  and  three  reorganized  by  the  Greedy  Sewing  Algorithm  with  p  set  to 
.80,  .90,  and  .95.  A  summary  of  the  programs,  the  basic  block  counts,  and  trace 
sizes  is  in  Tables  3.1  and  3.2.  I  collected  a  trace  of  each  of  these  twelve  programs 
which  were  then  used  as  input  to  Mark  Hill’s  DineroIII  cache  simulation  program 
[19].  Each  trace  was  simulated  on  eleven  different  cache  configurations:  256  bytes 
with  4  and  8  byte  blocks;  1024  byte  cache  with  4,  8,  and  16  byte  blocks;  and  4096 
byte  cache  with  4,  8,  and  16  bytes,  each  using  single  associativity  (direct-mapped) 
and  two-way  set  associativity. 

The  first  program  was  scrunch,  a  Huffman  encoding  algorithm.  The  profile 
was  generated  by  scrunchmg  a  200K  spelling  dictionary.  The  trace  was  created  by 
scrunchmg  scrunch. c,  a  42Kb  C  source  file. 

A  second  program,  troff.  was  chosen  because  of  its  wide  use  in  UNIX  envi¬ 
ronments.  The  profile  was  generated  by  troff  mg  three  separate  technical  documents, 
chosen  to  represent  a  broad  and  typical  use  of  the  program.  The  first  document  con¬ 
sisted  of  103K  bytes  after  being  preprocessed  by  tbl,  eqn,  and  grn,  This  included 
1933  lines  (32K  bytes)  of  troff  commands,  the  remainder  being  plain  text.  The 
other  two  documents  totaled  228K  bytes  and  contained  4004  lines  (73K  bytes)  of 
preprocessed  commands.  The  trace  was  created  by  troff  mg  a  reduced  version 
of  the  first  document  of  length  7705  bytes,  of  which  273  lines  (2728  bytes)  were  troff 
commands. 

A  third  program  was  the  Gnu  C  compiler  itself.  The  profile  was  collected  of 
the  compiler  compiling  three  Gnu  C  source  files:  toplev.c,  loop.c,  and  recog. c.  They 
totaled  79K  bytes,  with  20.  12,  and  15  C  function  definitions,  respectively.  The 
trace  was  collected  while  compiling  genemit.c,  a  6Kb  file  containing  nine  function 
definitions. 
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Figure  3.5:  Scrunch  miss  rates  and  percent  improvement 


3.3.2  Miss  Rates 

Figures  3.5,  3.6,  and  3.7  are  histograms  of  the  results  of  running  the  traces 
of  the  programs  through  the  cache  simulator.  (Since  the  reorganization  algorithm 
didn’t  take  any  cache  parameters  into  account,  the  same  traces  are  used  in  all  eleven 
simulation  runs  for  each  program.)  They  show  the  miss  rates  for  the  four  versions 
of  each  program  on  each  of  the  eleven  cache  configurations.  The  leftmost  bar  of 
each  group  of  four  is  the  average  miss  rate  of  the  nnreorganized  program.  The  other 
three  of  the  group,  from  left  to  right,  are  the  average  miss  rates  for  the  reorganized 
versions  for  p  =  80%,  90%,  and  95%  respectively.  The  cache  configuration  is  noted 
beneath  each  group.  Figures  3.5,  3.6,  and  3.7  show  the  improvement  in  miss  rates 
of  the  reorganized  versions  over  the  miss  rates  of  the  nnreorganized  versions  (i.e. 
1  - 

There  are  instances  where  reorganization  can  buy  the  (miss  rate)  equivalent 
of  a  larger  cache.  For  example,  looking  at  Figures  3.5,  3.6,  and  3.7  we  see  that  the 
reorganized  programs  using  a  256  byte  cache  with  8-byte  blocks  consistently  had  as 
good  as  or  better  miss  rates  than  its  nnreorganized  version  running  in  a  IK  cache 
with  Tbyte  blocks. 
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Figure  3.6:  Troff  miss  rates  and  percent  improvement 
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256,4  256,8  1K,4  1K,8  1K,16  4K,4  4K,4,2  4K,8  4K,8,2  4K,16  4K,16,2 


Figure  3.7:  Ccl  miss  rates  and  percent  improvement 
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3.3.3  Performance  improvement 

Let  Ru  be  the  number  of  instruction  fetches  on  the  original,  unreorganized 
program,  and  let  be  the  corresponding  nuiuber  for  a  reorganized  version.  For  the 
sake  of  these  estimates,  we  will  assuiue  R  =  R^  =  R^.  We  see  froiu  Table  3.2  that 
this  is  not  strictly  true,  but  they  are  sufficiently  close  for  our  purposes  here.  Let 
Mu  be  the  number  of  cache  misses  and  the  number  of  hits  when  the  original, 
unreorganized  prograiu  is  run,  and  and  be  the  corresponding  values  for  the 
reorganized  prograiu.  Define  the  miss  rates  =  M^j R  and  =  M^j R,,  and 
let  =  rn,u  —  be  the  difference  in  miss  rates.  Let  be  the  time  required  to 
handle  a  cache  hit,  and  let  be  the  time  required  to  handle  a  cache  miss.  The 
running  time  of  the  original  program  is  then  =  t^hu  +  and  the  improved 

running  time  is  =  thh^  +  tmTiir-  Finally,  define  /  =  tmrtiAlTu,  the  fraction  of  the 
original  program’s  time  taken  up  by  cache  misses  that  are  turned  into  cache  hits, 
and  K  =  the  ratio  of  the  cost  to  handle  a  miss  to  the  cost  to  handle  a  hit. 

Then  by  Amdahl’s  Law: 

T./T.  =  (l-/)  +  //A'.  (3.1) 

We  can  now  estimate  the  improvements  in  performance  from  reorganization 
taking  our  example  from  the  specification  of  the  SPUR  memory  architecture  [18]. 
In  general,  the  cost  of  a  miss  is  very  high  on  multi-processor,  shared-bus  systems 
due  to  bus  contention  or  the  length  of  the  cache  line.  SPUR  has  a  512-byte  on- 
chip  cache  and  128Kb  off-chip  cache.  According  to  Mark  Hill,  a  miss  in  the  on-chip 
cache  costs  three  times  as  much  as  a  hit,  assuming  the  instruction  to  be  in  the 
off-chip  memory  cache  [20].  Let  us  assume  the  on-chip  cache  shows  a  normal  miss 
rate  of  about  20%,  and  that  we  can  improve  that  to  15%  by  reorganizing.  Then 

/  =  3*.05/(3*.20-|-.80)  =  .107.  Plugging  this  into  (1)  above,  we  get  =  .929,  i.e. 

the  prograiu  executes  in  only  92.9%  of  the  time  of  the  original,  a  7.1%  improvement. 
The  maximum  possible  improvement  is  29.6%  assuming  the  unattainable  miss  rate 
of  0%. 

For  the  SPUR  architecture,  an  off-chip  cache  miss  will  cost  12  to  20  times 
that  for  handling  a  cache  hit.  SPUR  therefore  has  a  very  large  mixed  cache  to  combat 
this  penalty.  If  we  assume  that  reorganization  can  reduce  SPUR’s  miss  rate  by  an 
absolute  0.25%  (e.g.  from  1%  to  0.75%),  then,  assuming  K  =  17  (a  number  lifted 
from  Katz  and  Eggers  [26]),  /  =  17  *  .0025/(17  *  .01  4-  .99)  =  .0366.  Plugging  this 
into  (1)  above,  we  get  T^/Tu  =  .966,  a  3.4%  improvement  in  performance.  This  is  in 
addition  to  the  performance  improvement  for  the  on-chip  cache  noted  above.  With 
these  assumptions,  we  predict  reorganization  can  improve  SPUR’s  performance  by 
about  10%. 

This  prediction  is  consistent  with  other  numbers  recently  published  for 
similar  systems.  Pettis  and  Hansen  [36]  at  Hewlett-Packard  Laboratories  report 
10%-26%  improvement  on  a  machine  with  a  16Kb  unified  cache  (the  HP-UX  825). 
When  they  increased  the  cache  size  to  128Kb,  the  improvements  decreased  to  less 
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than  10%  on  the  HP-UX  835:  all  five  benchmarks’  improvements  averaged  5%,  but 
with  a  very  wide  range  of  differences  (0.8%  to  9.3%).  Their  study  concentrated  on 
placing  code  to  improve  cache  performance  utilizing  information  about  the  size  of 
the  cache.  They  also  used  arc  counts,  rather  than  node  counts,  but  their  algorithm 
operated  over  the  whole  program,  not  just  the  hot  spots.  They  also  counted  each 
and  every  arc  in  the  flow  graph,  which  may  account  for  the  compiler  going  twice  as 
slowly  when  instrumenting  the  program.  Their  algorithm  for  positioning  the  code 
slowed  the  compiler  down  by  about  15-20%. 

Figures  3.8,  3.9,  and  3.10  show  the  theoretical  improvement  in  execution 
performance  of  the  reorganized  versions  over  the  original  nnreorganized  versions 
when  the  cost  factor  K  =  2.  3,  4,  5,  6,  7,  10,  15,  20,  and  25.  Each  graph  has  a 
column  for  each  of  the  eleven  cache  configurations.  A  line  within  a  coluiun  plots  the 
expected  perforiuance  of  the  indicated  prograiu  on  that  cache  when  reorganized  with 
the  three  values  p  =  .80,  .90,  .95  luoving  from  left  to  right.  K  =  2  is  the  very  top 
line  in  each  coluiun.  and  K  =  25  is  the  bottoiu-iuost  line  in  each  column,  yielding  a 
range  in  which  I  expect  reorganization  to  improve  the  performance  of  the  prograius. 
So  we  see  that  for  a  IK  cache  with  4-byte  blocks  and  K  =  2,  a  version  of  scrunch 
reorganized  with  p  =  80%  would  take  about  97%  as  long  as  the  nnreorganized  version 
(the  left  end  of  the  topmost  line  in  the  third  column  from  the  left).  It  would  take 
only  about  77%  as  long  if  K  =  25  (the  left  end  of  the  bottom  line  in  that  column), 
and  only  about  38%  as  long  when  K  =  25  and  p  =  95%  (the  rightmost  end  of  the 
bottom  line  in  that  column). 


3.4  Limitations 

The  algorithm  for  StitchCall  is  not  quite  correct,  since  it  does  not  handle 
nested  procedure  calls  correctly.  The  net  effect  on  the  numbers  is  not  at  all  clear, 
but  it  should  be  slight  in  whichever  direction  it  goes.  There  siiuply  were  not  that 
many  sequences  of  nested  procedure  calls  that  could  fit  in  a  cache  in  the  code  I  used 
as  test  cases.  This  would  not  be  true  in  languages  that  encouraged  the  use  of  many 
small  procedures,  e.g.,  C-f-f. 

Furthermore,  it  is  not  clear  whether  pseudo-inlining  a  procedure  in  only  one 
location  is  sufficient.  Further  tests  should  be  performed  to  determine  whether  it  is 
worthwhile  to  copy  the  main  thread  of  execution  of  a  frequently  executed  basic  block 
into  multiple  locations.  If  I  had  to  guess  as  to  which  would  have  the  luost  effect  on 
the  results — correcting  the  nested  procedure  call  problem  or  luaking  luultiple  copies 
of  threads  of  procedure  execution — I  would  say  that  thread  copying  would  probably 
have  luore  effect. 

There  is  a  potential  probleiu  in  isSmallEquiC ondl  in  that  it  will  not  handle 
correctly  a  small  if-then-else  where  one  of  the  anus  of  the  conditional  is  empty,  and 
the  conuuon  exit  block  loops  on  itself.  This  situation  never  arose  in  my  experiments, 
and  so  the  potential  probleiu  was  not  detected  until  this  writing. 
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Figure  3.8:  Scrunch  performance  for  K  =  2,  3,  4,  5,  6,  7,  10,  15,  20,  and  25 
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Figure  3.9:  Troff  performance  for  K  =  2,  3,  4,  5,  6,  7,  10,  15,  20,  and  25 
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Figure  3.10:  Ccl  performance  for  K  =  2,  3,  4,  5,  6,  7,  10,  15,  20,  and  25 


46 


I  have  not  considered  many  of  the  parameters  for  designing  a  cache  that 
may  be  relevant.  For  example,  I  have  not  considered  the  increase  in  bus  contention 
caused  by  going  to  a  larger  block  size.  Nor  have  I  considered  alternative  cache 
management  strategies  such  as  sub-block  placement.  My  goal  was  simply  to  explore 
improving  the  performance  of  the  cheapest  of  these  design  alternatives  with  simple 
compiler  enhancements. 


I  have  not  considered  cache  effects  such  as  cold  start  misses  or  cache  flushes 
due  to  system  interrupts  or  context  switches.  I  did  not  consider  any  parameters  of 
the  target  caches  when  reorganizing  code.  It  was  not  clear  at  the  beginning  of  this 
research  how  much  the  parameterization  of  the  algorithms  by  the  cache  characteris¬ 
tics  would  benefit  the  program’s  performance;  hence  I  went  for  the  simple  solution 
first. 


I  haven’t  solved  the  problem  of  case  statements  satisfactorily.  Currently,  it 
is  possible  for  the  reorganizer  to  move  the  code  around  to  such  an  extent  that  the 
juiup  table  can  end  up  quite  a  distance  away  froiu  one  or  more  of  its  targets.  On  the 
68020.  this  presents  a  practical  problem  since  juiup  tables  with  half-word  pc-relative 
entries  are  luuch  faster  than  full  word  entries.  If  an  iteiu  of  a  case  is  a  “hot  spot”,  it 
is  very  difficult  to  relocate  it  and  still  satisfy  the  distance  constraint  of  jump  tables. 
The  only  prograiu  that  gave  me  real  probleius  was  the  Gnu  C  coiupiler,  for  which  I 
generated  full-word  jump  table  entries.  This  does  not  change  any  of  the  miss  rate 
results  significantly,  but  in  real  life,  it  would  be  an  unacceptably  slow  iiupleiuentation 
due  to  the  slower  execution  of  the  table  juiup. 


Finally,  there  are  architectures  that  present  difficulties  for  the  Greedy  Sewing 
algorithm.  An  example  is  the  MIPS-X  instruction  set  [8],  which  has  asymmetric 
conditional  branch  instructions.  Due  to  the  nature  of  the  MIPS-X  pipeline,  each 
conditional  branch  instruction  is  followed  by  two  instructions  that  are  fetched  before 
the  CPU  has  determined  whether  the  branch  will  be  taken.  Each  conditional  branch 
instruction  also  has  a  squash  bit  that,  if  ‘on’,  prevents  the  execution  of  these  two  in¬ 
structions  if  the  branch  is  not  taken.  Both  of  these  delay  slot  instructions  are  always 
executed  whenever  the  branch  is  taken:  there  is  no  way  to  squash  their  execution 
when  the  branch  is  taken.  This  makes  sense  if  all  programs  follow  the  pattern  of 
code  generated  by  most  compilers,  where  the  conditional  test  is  at  the  ‘bottom’  of 
the  loop  and  the  conditional  branch  is,  therefore,  almost  always  taken. 
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However,  the  results  of  Greedy  Sewing’s  reorganization  described  here  turns 
those  statistics  upside  down.  The  algorithm  almost  guarantees  that  the  head  and  tail 
of  a  frequently-executed  loop  will  be  made  contiguous,  and  that  execution  will  almost 
always  fall  through  the  conditional  test  to  the  head  of  the  loop:  the  conditional 
branch  is  almost  never  taken.  This  leaves  three  options:  (1)  fill  the  delay  slots 
as  is  currently  done  by  MIPS-X  compilers,  followed  by  a  jump  instruction  to  the 
infrequent  target  (normally  the  loop-exit);  (2)  reverse  the  sense  of  the  conditional 
and  try  to  fill  the  delay  slots  with  instructions  that  don’t  have  to  be  squashed  when 
the  branch  is  not  taken  (because  there  is  no  squash  bit  for  this  direction);  or,  (3) 
punt  and  put  no-ops  in  the  delay  slots. 

Option  (1)  puts  infrequently  executed  instructions  right  in  the  middle  of 
high-frequency  basic  blocks,  working  against  one  of  the  aims  of  reorganization  (better 
cache  utilization  in  high  frequency  code).  Option  (2)  sounds  plausible,  but  it  is 
difficult  to  find  instructions  that  can  always  be  executed  no  matter  which  way  the 
branch  goes.  It  may  be  possible  to  generate  instructions  to  nn-do  the  effects  of 
the  delay-slot  instructions  when  the  branch  is  finally  taken,  but  this  begins  to  get 
complicated  and  presents  the  possibility  of  really  slowing  down  a  frequently  executed 
inner  loop.  Option  (3)  is  an  obvious  loss.  McFarling  [34]  implemented  option  (1), 
and  reports  that  the  size  of  repositioned  code  increases  about  14%. 

Due  to  the  fact  that  the  available  A4IPS-X  compilers  filled  the  delay  slots 
before  emitting  assembly  language  code  requiring  my  software  to  have  a  MIPS-X 
assembly  language  parser,  and  given  the  fact  that  the  only  profile  data  I  had  was  the 
basic  block  counts  generated  by  their  compiler,  applying  Greedy  Sewing  to  MIPS-X 
code  was  too  far  outside  the  reach  of  this  research. 


3.5  Conclusions 

Profile  driven  code  reorganization  definitely  improves  the  performance  of 
programs.  In  envisioned  programming  environments  where  profile  data  is  a  perma¬ 
nent  part  of  the  information  manipulated  by  both  programmer  and  compiler,  these 
improvements  would  come  simply  and  cheaply.  My  experiments  have  shown  im¬ 
provements  in  miss  rates  on  the  order  of  30%  to  50%,  and  sometimes  as  high  as  50% 
to  80%.  These  figures  were  obtained  by  relocating  only  3%  to  8%  of  the  basic  blocks 
of  typical  programs. 
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Chapter  4 

A  High-level  use:  Implementation 
selection  of  abstract  data  types 


In  the  previous  chapter,  I  demonstrated  one  way  in  which  profile  data  could 
be  used  profitably  by  an  optimizing  compiler  at  a  low  level.  Code  reorganization  is 
attractive  since  it  improves  performance  at  a  small  cost,  that  cost  being  the  time  it 
takes  to  decide  the  order  in  which  to  emit  a  few  of  the  basic  blocks  of  a  program. 

In  this  chapter,  we  will  look  at  what  a  compiler  might  do  with  profile  data 
at  a  very  high  level.  Specifically,  I  developed  TYPESETTER,  a  system  for  selecting 
implementations  for  abstract  data  type  representations  and  functions.  By  assmuing 
that  profile  data  exists  for  a  program,  we  have  seen  that  low-level  prograiu  trans- 
foriuations  can  use  very  siiuple  algorithms  to  achieve  improvements  comparable  to 
much  luore  coiuplicated  algorithms.  With  TYPESETTER  I  tested  whether  coiupara- 
ble  simplifications  could  be  luade  in  selecting  implementations  of  high-level  abstract 
data  types. 


4.1  The  Problem 

Before  stating  the  probleiu,  it  is  useful  to  differentiate  between  two  classes 
of  prograiuiuers  that  would  make  use  of  TYPESETTER.  The  User  of  TYPESETTER 
would  write  a  program  using  only  the  available  abstract  data  types,  and  would  not 
be  concerned  with  how  those  abstractions  were  eventually  implemented  (as  long  as 
the  implementations  were  relatively  inexpensive,  of  course).  The  Implementor  is 
the  progranuuer  that  adds  impleiuentations  to  the  TYPESETTER  systeiu.  I  do  not 
envision  that  TYPESETTER  can  be  (or  should  be)  an  extensible  language  systeiu  at 
the  User  level.  If  new  implementations  are  to  be  added,  it  is  an  enhancement  to 
the  system,  and  not  simply  the  shipment  of  a  new  library.  This  difference  between 
User  and  Implementor  allows  us  to  discuss  efficiently  the  difference  between  using 
an  abstraction  and  implementing  it  by  referring  to  the  specific  programmer. 

The  general  problem  can  be  stated  simply:  what  is  the  most  efficient  imple- 
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mentation  of  a  User’s  program?  This  includes  selecting  the  most  efficient  instruction 
sequences  as  well  as  the  best  implementations  for  the  program’s  data  structures. 
Most  compilers  completely  side-step  this  latter  problem  by  giving  the  programmer  a 
specific  set  of  data  types  with  unique  implementations  and  letting  the  programmer 
construct  the  necessary  data  structures  with  the  pre-defined  data  types  provided  by 
the  system. 

For  example,  there  are  many  ways  sets  of  objects  might  be  implemented: 
linked  lists  of  various  kinds,  bit  maps,  arrays,  trees  organized  by  various  techniques, 
etc.  Which  of  these  implementation  should  be  used  for  a  particular  program  depends 
on  the  algorithms  used  in  the  program  and  the  characteristics  of  the  data.  For  the 
most  part,  letting  the  compiler  select  which  implementation  to  use  based  only  on 
static  declarations  has  proven  viable,  but  difficult  and  expensive.  The  problem  is 
difficult  even  when  attention  is  focused  on  a  small  set  of  abstractions,  as  the  SETL 
language  effort  has  shown  [46,51].  Barstow’s  PECOS  system  [3,4]  is  an  attempt  to 
collect  a  database  of  rules  and  heuristics  that  allow  a  programmer’s  specification  of 
a  program  to  be  given  an  implementation.  Elaine  Kant  [24]  extended  the  system  to 
consider  rules  and  heuristics  regarding  the  efficiency  of  various  implementations. 

My  approach  is  to  assume  that  programming  environments  of  the  future 
will  be  collecting  and  maintaining  much  more  information  about  a  program  than 
the  programmer’s  static  declarations.  In  particular,  the  collection  and  utilization  of 
profile  data  will  become  a  matter  of  course.  In  this  chapter,  I  address  the  following 
questions: 

•  Can  profile  data  reduce  the  complexity  of  the  representation  selection  problem 
to  a  level  that  compilers  can  make  such  choices  effectively? 

•  What  information  needs  to  be  collected  by  the  profile  mechanism  so  the  selector 
can  make  effective  choices? 

•  How  much  control  over  the  collection  of  profile  data  can  be  put  in  the  hands 
of  the  designers  and  implementors  of  abstract  data  types? 

•  Is  there  a  general  algorithm  for  selecting  representations  that  works  for  a  wide 
variety  of  abstract  data  types?  That  is,  can  we  limit  the  overall  task  of  the  Im¬ 
plementor  to  implementation  of  the  ADT,  specification  of  what  profile  data  to 
collect,  and  specification  of  the  runtime  resources  used  by  the  implementation? 

Obviously  all  these  questions  are  interrelated:  for  example,  the  selection 
algorithm  will  influence  the  kinds  of  profile  data  that  will  be  necessary,  and  pos¬ 
sibly  the  detail  to  which  the  Implementor  must  go  to  describe  the  behavior  of  an 
implementation. 

We  are  interested  in  improvements  only  from  data  representation  selection, 
in  contrast  to  algorithm  transformations,  traditional  examples  of  which  include  such 
optimizations  as  code  motion,  finite  differencing,  strength  reduction,  etc.  (Low  [31] 
calls  these  representation  dependent  optimizations.) 
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As  an  example,  consider  the  implementation  of: 

5*  ^  5*  U  {a,  6,  c] 

which  may  be  more  efficiently  implemented  as 

put  a  in  S 
put  b  in  S 
put  c  in  S 

depending  on  the  representation  selected  for  the  set  S.  Consider  also  the  boolean 
expression: 

X  e  {SI  U  S2) 

which  is  nsnally  much  more  efficiently  computed  as 

(A'  G  51)  V  (A'  G  52) 

For  the  purposes  of  this  research,  we  will  assume  that  such  optimizations  are  discov¬ 
ered  by  a  high-level  optimization  module.  This  ignores  problems  introduced  by  the 
interaction  of  this  ‘higher  level’  optimization  and  onr  representation  selection  mech¬ 
anism,  but  allows  ns  to  investigate  the  use  of  profile  data  in  the  selection  process. 
Assuming  the  latter  is  possible,  the  interaction  problem  can  be  investigated  later. 

Unlike  Low.  we  do  not  want  to  limit  expressions  to  be  homogeneous  in 
representations.  That  is,  in  the  expression 

51  ^  51U52; 

if  51  and  52  are  two  sets,  it  may  be  the  case  that  they  have  two  different  represen¬ 
tations.  We  are  interested  in  seeing  if  there  are  situations  in  which  it  is  profitable  to 
handle  the  overhead  imposed.  In  this  example,  either 

1.  One  of  51  or  52  must  be  converted  to  the  same  representation  as  the  other,  or 

2.  51  and  52  are  both  converted  to  a  third  representation,  or 

3.  a  routine  to  take  the  union  of  objects  of  type  itype{Sl)  and  itype{S‘2)  luust  be 
generated  by  the  compiler,  or 

4.  there  must  already  exist  a  procedure  that  can  explicitly  handle  the  union  of 
these  two  representations. 

Larry  Rowe  [38]  explored  the  probleiu  of  generating  iiupleiuentations  and  we  will 
not  pursue  it  in  this  study.  The  TYPESETTER  prototype  requires  that  the  explicit 
function  luust  exist  (corresponding  to  the  fourth  option  above). 

Typesetter  has  been  designed  to  allow  experimentation  with  various 
kinds  of  profile  data  in  addtion  to  execution  counts.  For  instance,  knowing  that  the 
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attempts  to  add  an  element  to  a  set  nsnally  have  no  effect  because  the  element  is 
already  in  the  set  may  affect  which  implementation  is  chosen  for  that  set.  It  is  easy 
in  Typesetter  to  count  how  many  tiiues  the  add-an-element  function  was  called  to 
add  an  object  already  a  luember  of  the  target  set.  It  would  be  exceptionally  difficult 
to  determine  analytically  when  such  a  situation  holds. 

The  traditional  metric  (or  objective  function)  used  for  selecting  one  repre¬ 
sentation  over  another  has  been  the  space-tiiue  product,  but  there  are  complications 
in  dealing  with  paraiueters  of  the  systeiu  in  which  the  software  is  to  be  run.  For 
example,  it  luay  be  that  representation  X  is  better  as  long  as  the  amount  of  memory 
used  does  not  exceed  physical  lueiuory  liiuits,  otherwise  using  the  luore  coiupact 
representation  Y  will  cause  less  thrashing  of  the  virtual  lueiuory  system.  The  in¬ 
efficiency  due  to  a  denser  encoding  would  be  offset  by  the  improved  perforiuance 
of  the  systeiu  as  a  whole.  It  is  problematic  how  to  specify  these  kinds  of  limits, 
particularly  those  that  depend  on  system  parameters  that  are  difficult  to  profile  or 
can  very  dynamically  and  orthogonally  to  the  actions  of  the  program  (e.g.  dynamic 
paging  rates).  Therefore,  this  study  has  not  attempted  to  determine  the  ‘best’  way 
of  characterizing  program  behavior.  The  system  is  designed  such  that  each  inter¬ 
face  function  has  exactly  one  evaluation  function,  which  returns  a  real  number  that 
represents  the  relative  behavior  of  the  interface  function  at  a  particular  call  site. 

Also,  this  study  limits  itself  to  implementations  that  do  not  require  auto¬ 
matic  changes  to  User-defined  data  structures;  all  abstract  data  types  implemented 
in  the  system  will  be  ‘exoiuorphic’-only  pointing  to  objects  the  user  is  luanipulating. 
For  exaiuple,  an  exoiuorphic  iiuplementation  of  a  list  would  luanipulate  only  refer¬ 
ences  to  the  objects  in  the  list.  In  contrast,  an  endomorphic  implementation  might 
include  the  links  of  the  list  as  part  of  the  User  object,  one  (set  of)  link(s)  for  each  list 
to  which  the  object  luight  belong.  An  endoiuorphic  strategy  might  be  particularly 
attractive  when,  for  exaiuple,  it  is  known  that  each  object  can  be  on  only  one  list  at 
a  time. 

The  problem  of  generating  or  modifying  structures  to  take  advantage  of 
such  implementations  is  orthogonal  to  the  problem  of  using  profile  data  to  determine 
which  implementation  is  best.  Once  the  use  of  profile  data  is  shown  to  be  viable, 
then  the  same  techniques  can  be  applied  to  endomorphic  implementations. 

Programs  that  exhibit  phase  behavior  are  problematic.  A  user’s  program 
could  exhibit  phase  behavior  by  manipulating  data  one  way  early  in  the  execution 
of  the  program  (say,  during  the  initial  construction  of  an  aggregate  variable)  and 
utilizing  that  data  quite  differently  in  later  stages  of  the  program  (say,  during  access 
and  modification  of  that  aggregate  variable).  An  interesting  problem  is  the  detection 
of  the  behavior  and  the  optimal  points  for  changing  the  implementation  of  the  ADT 
from  one  representation  suitable  for  the  first  phase  into  another  representation  more 
suitable  for  the  later  phase.  A  general  solution  to  this  probleiu  is  beyond  the  scope 
of  our  work.  In  our  luodel,  each  static  instance  of  a  variable  will  have  exactly 
one  iiupleiuentation.  We  can  approxiiuate  soiue  of  the  advantages  of  phase  behavior 
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detection  by  limiting  conversion  from  one  representation  to  another  at  the  assignment 
of  a  variable.  Consider  the  following: 

Set(int)  A] 

:  (section  one) 

- end  of  phase  one  of  program 

:  (section  two) 


It  may  very  well  be  the  case  that  the  behavior  of  the  program  in  section 
one  demands  that  the  variable  A  be  implemented  as  a  singly-linked  list,  whereas  the 
behavior  of  section  two  would  be  more  efficient  if  A  were  implemented  as  a  donbly- 
linked  list.  In  onr  current  model,  A  will  be  assigned  only  one  implementation  that  will 
minimize  the  cost  of  running  the  program.  If  the  program  looked  like  the  following: 

Set(int)  A] 

Set(int)  B] 

:  (section  one  uses  A) 

- end  of  phase  one  of  program 

B  =  A] 

:  (section  two  uses  B) 


then  we  can  look  for  the  possibility  of  converting  representations  when  B  is  assigned, 
and  releasing  resources  used  by  A.  This  would,  of  course,  require  live-dead  analysis, 
soiuething  that  is  currently  beyond  the  prototype. 


Another  complication  is  introduced  by  user-defined  functions.  Consider  the 
following: 
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main{)  { 

Set(int)  A] 

Set(int)  B] 

user_fcn{A,B)] - call  site  1 

user_fcn{B,A)] - call  site  2 


} 

void  userJcn{Set{int)  X,  Set(int)  Y)  { 

} 

Under  our  model,  A  and  B  must  have  the  same  implementation  since  X  can  have 
only  one  implementation  (the  same  is  true  for  Y).  It  may  very  well  be  the  case 
that  the  program  would  perform  better  if  there  were  two  user  functions,  each  with 
a  different  type  signature.  However,  creation  of  multiple  copies  of  the  user  function 
are  beyond  the  capabilities  of  the  prototype.  At  any  rate,  the  probleiu  is  again 
orthogonal  to  the  probleiu  of  using  profile  data,  so  we  apply  the  general  rule  stated 
above  for  variables  to  the  signature  of  functions:  each  statically  declared  object  in 
the  User’s  program  will  have  exactly  one  representation  associated  with  it  during 
the  running  of  the  program. 

Real-time  applications  can  impose  severe  constraints  on  the  behavior  of 
a  program,  and  are  not  considered  further  in  this  work.  The  work  of  Kenny  and 
Lin  [27]  provides  an  important  parallel  to  our  work  in  the  area  of  real-time  control. 
In  particular,  we  discuss  later  how  their  evaluation  function  generation  techniques 
could  be  used  in  our  system  to  improve  the  precision  and  portability  of  evaluation 
functions  (see  page  104). 


4.2  Previous  work 

Typesetter  is  the  first  system  to  use  a  general  technique  for  collecting 
ADT-specific  profile  data,  and  using  that  data  to  choose  iiupleiuentations.  Aliuost 
all  previous  systems  (with  the  exception  of  Low’s)  attempt  to  synthesize  data  rep¬ 
resentations:  Typesetter  chooses  the  implementation  of  a  function  to  use,  and 
thereby  indirectly  selects,  but  does  not  synthesize,  the  representations  of  the  pro- 
graiu’s  variables. 

Low  did  the  original  work  on  impleiuentation  selection  using  profile  data 
[31,32,33].  His  systeiu  atteiupted  to  provide  implementations  for  three  abstractions: 
sets,  lists,  and  a  ternary  relation  which  is  unique  to  the  base  language  SAIL.  Each 
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partition  the  variables  and  expressions  into  equivalence  classes  [eq.c.)] 
determine  which  operations  are  performed  on  which  eq.c.] 
for  each  eq.c  do 

S  ^  all  representations  ; 

remove  from  S  all  representations  which  are  not  feasible] 

- may  not  have  sufficient  information  at  compile  time 

- may  require  an  operation  not  implemented  in  a  rep. 

predict  time  and  space  requirements  for  all  s  E  S 
for  all  Si^  S2  E  S 

if  Sj  requires  both  more  time  and  space  than  S2  then 
remove  Si  from  S 
endfor 

rank  remaining  representations  in  S  by  likelihood  of  being 

the  best  representation] - uses  a  co.st  fen 

use  a  hill — climbing  heuristic  to  finalize  implementations 


Figure  4.1:  Low’s  algorithm 


ADT  had  several  implementations  that  could  be  used  to  implement  the  User’s  vari¬ 
ables.  Each  of  the  functions  making  up  the  interface  of  an  ADT  was  written  in 
assembler,  and  had  associated  with  it  an  evaluation  function  that,  given  a  frequency 
of  execution  and  an  aggregate  size  (e.g.,  the  number  of  elements  in  a  set),  would  re¬ 
turn  an  estimate  of  the  cost  of  using  the  interface  function.  His  evaluation  functions 
returned  an  estimate  of  the  number  of  machine  cycles  and  bytes  required  on  any  one 
invocation  of  a  function. 

The  system  required  four  passes  over  a  program,  with  human  interaction  as 
one  of  the  passes.  The  first  step  ran  the  subject  program  (using  default  implemen¬ 
tations  for  the  abstractions)  with  software  monitoring  inserted  to  collect  a  profile 
of  the  performance  of  the  program  in  terms  of  statement  counts.  The  system  then 
prompted  the  user  for  information  too  difficult  or  impossible  to  derive  analytically 
(e.g.  “What  is  the  average  size  of  set  /oo?”).  A  penultimate  static  analysis  pass 
computed  the  possible  contents  of  variables  in  terms  of  other  variables.  This  had 
the  side-effect  of  partitioning  the  variables  of  program  into  equivalence  classes;  each 
equivalence  class  identified  the  variables  that  had  to  have  the  same  implementation 
as  all  other  variables  in  the  class. 

Low’s  algorithm  for  selecting  representations  (Figure  4.1)  uses  call  sites 
solely  for  feasibility  testing.  Once  a  set  of  feasible  assignments  have  been  established, 
then  an  initial  set  of  implementations  are  assigned.  The  final  heuristic  (Rowe  called 
it  incremental  search)  continually  ‘perturbs’  the  assignment  of  implementations  by 
making  changes  that  seem  likely  to  improve  the  overall  performance  of  the  program 
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and  keeping  only  those  changes  that  do. 

Low  says  that  his  system  could  not  take  into  account  certain  features  of  set 
insertion  (e.g.,  elements  are  always  inserted  in  a  specific  order)  and  that  he  thinks 
it  would  be  hard  to  include.  TYPESETTER  allows  collecting  this  kind  of  informa¬ 
tion  (i.e.  ‘the  data  is  always  added  in  increasing/decreasing/non-increasing/non- 
decreasing/extremal  order)  as  well  as  other  information  (e.g.,  sizes  of  sets,  average 
length  of  lists,  etc.). 

Low’s  system  did  not  allow  operators  to  work  on  multiple  representations: 
a  union  operator’s  two  operands  had  to  have  the  same  representation.  In  onr  ap¬ 
proach,  a  particular  implementation  function  can  be  assigned  to  a  call  site  if  the 
actual  parameters  at  the  call  site  can  be  assigned  the  types  of  the  implementation 
function’s  formal  parameters.  It  is  np  to  the  Implementor(s)  which  of  these  mixed- 
representation  functions  to  implement. 

Typesetter  explores  several  aspects  of  Low’s  general  technique.  Low’s 
ADT  interface  functions  were  written  in  assembler  so  he  could  make  the  evaluation 
of  an  invocation  of  one  of  those  functions  as  precise  as  possible.  He  did  not  want 
to  tackle  the  problem  of  writing  evaluation  functions  for  compiler-generated  code. 
Given  that  precision  is  lost  in  any  estimation  of  future  performance  of  a  real  program, 
and  that  the  performance  of  a  function  depends  on  more  than  just  its  frequency 
of  execution  and  the  size  of  the  aggregate-type  object.  Typesetter’s  evaluation 
functions  accept  inexactness  as  inevitable,  and  assume  that  prograius  satisfying  the 
90-10  rule  are  skewed  enough  to  luake  the  loss  of  precision  irrelevant  to  the  final 
decisions. 

Also,  Typesetter’s  interface  and  evaluation  functions  are  all  in  a  higher- 
level  language  (C-|--|-),  and  the  evaluation  functions  are  in  terms  of  this  language’s 
constructs.  That  is,  whereas  Low’s  system  required  the  Implementor  to  count  cycles 
in  instructions  in  order  to  write  an  evaluation  function,  TYPESETTER’S  evaluation 
functions  are  written  in  terms  of  the  high-level  language’s  constructs.  TYPESETTER 
has  not  solved  the  probleiu  of  providing  an  evaluation  of  compiler-generated  code, 
rather  it  finesses  the  whole  probleiu  by  admitting  up  front  that  evaluation  is  inexact. 
Precision  is  not  possible  a  priori  with  compiled  high-level  language  code  (e.g.,  a 
different  compiler’s  optimizer  will  produce  different  code),  but  in  exchange  we  get 
portable,  understandable,  easily  tuned  code,  both  for  the  implementations  of  the 
interface  functions,  but  also  for  the  evaluation  functions. 

And  finally.  Low’s  system  concentrated  on  finding  implementations  for  a 
program’s  variables,  using  program  structure  solely  to  determine  the  feasibility  of 
the  various  implementations.  TYPESETTER  turns  that  around,  and  concentrates  on 
finding  implementations  for  the  interface  functions,  and  lets  that  specify  what  the 
implementation  of  the  variables  must  be.  Section  4.4.1  below  explains  this  method 
of  iiupleiuentation  selection  in  detail. 

The  other  major  work  relevant  to  TYPESETTER  is  Hansen’s  work  on  adap¬ 
tive  coiupilation,  which  we  already  discussed  in  soiue  detail  in  Chapter  1.1.  Hansen’s 
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work  is  the  primary  justification  for  this  research  in  compiler  utilization  of  profile 
data,  though  Hansen’s  system  was  targeted  toward  ‘one-shot’  compilation. 

Other  work  relevant  to  what  we  are  doing  has  concentrated  primarily  on 
how  to  assign  implementations  analytically.  That  is,  given  some  program  specifica¬ 
tion,  most  work  has  concentrated  on  finding  means  for  determining  implementations 
solely  on  the  basis  of  that  specification. 

Barstow’s  PECOS  system  [3,24]  processes  a  program  specification  via  a 
knowledge  database  of  program  transformation  rules.  The  program  is  declaratively, 
as  opposed  to  imperatively,  specified.  Kant’s  LIBRA  system  [24]  extended  PECOS 
and  attempted  to  apply  the  same  rule-based,  synthesizing  approach  to  performance 
prediction.  The  knowledge  database  was  enhanced  with  rules  about  estimating  po¬ 
tential  performance  of  partially  constructed  programs.  Although  LIBRA  could  allow 
the  use  of  profile  data,  it  was  neither  integral  nor  essential  to  the  approach. 

Typesetter  does  not  attempt  to  synthesize  programs  analytically,  nor 
does  it  attempt  to  work  with  program  synthesis  at  as  high  a  level  as  does  PECOS. 
Rather,  my  goal  was  to  explore  the  possibility  of  providing  implementation  selection 
in  the  context  of  modern  day  compilers.  Rather  than  seek  a  Copernican  revolution 
and  invent  a  totally  new  language  in  which  to  specify  programs,  I  sought  a  more 
evolutionary  approach  to  give  existing  languages  and  systems  as  much  capability  as 
possible. 

Ramirez  [37]  used  zero-one  integer  programming  to  assign  implementations. 
His  approach  required  condensing  the  behavior  of  a  program  into  two  matrices  s{i,j) 
and  where  s  is  the  estimated  storage  space  consumed  by  implementation  j 

when  used  to  implement  variable  (substructure,  he  calls  it)  i,  and  t  is  the  corre¬ 
sponding  time  estimate.  His  claim  that  the  behavior  of  an  implementation  of  an 
ADT  can  be  summarized  by  two  numbers  s{i,j)  and  is  highly  suspect.  It 

ignores,  for  example,  how  the  behavior  of  a  function  or  operator  may  change  when 
provided  with  arguments  of  differing  implementations.  That  is,  he  assumes  that  if 
implementation  j  is  assigned  to  variable  i,  then  the  amount  of  time  required 

by  that  assignment,  is  independent  of  any  other  assignments.  This  is  almost  never 
the  case,  particularly  when  operators  can  accept  operands  with  differing  implemen¬ 
tations  (e.g.,  a  union  of  a  set  implemented  as  a  bitmap  with  a  set  implemented  as  a 
linked  list).  TYPESETTER  moves  the  focus  of  evaluation  functions  from  the  variable 
to  the  implementation  of  the  interface  functions.  This  allows  the  interacting  costs 
of  assignments  to  be  taken  into  account  at  the  expense  of  losing  the  ability  to  use 
zero-one  integer  programming  to  achieve  an  optimal  solution. 

Work  within  the  SETL  project  [9,43]  derives  representations  from  declara¬ 
tions  in  the  language  and  from  analysis;  e.g.,  frequencies  are  estimated  by  an  analysis 
of  the  program  text.  I  know  of  no  work  using  profile  data  in  the  synthesis  of  SETL 
programs. 

The  SETL  optimizing  compiler  attempts  to  determine  a  good  implemen¬ 
tation  of  for  the  set  and  mapping  abstractions  in  the  language  (there  is  only  one 
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representation  for  tuples).  The  default  representation  for  sets  and  maps  uses  hash 
tables.  If  the  analysis  can  determine  bases  for  the  elements  of  the  sets,  or  if  the 
programmer  declares  elements  to  belong  to  specific  bases,  then  other  more  efficient 
implementations  are  possible  for  subsets  of  the  bases.  A  subset  can  be  represented 
as  a  bit  in  the  structures  for  the  elements  of  the  bases  (if  the  bit  is  one,  then  the 
element  belongs  to  that  subset,  if  zero,  then  not).  If  all  elements  of  a  base  set  are 
assigned  unique  integers,  then  a  subset  can  be  impleiuented  as  a  bit-iuap.  Or  a 
subset  might  be  represented  with  a  separate  hash  table  of  pointers  into  the  base  set. 

Straub’s  Taliere  systeiu  iiuproves  on  the  optiiuization  phase  of  the  SETL 
coiupiler  by  considering  estiiuates  of  perforiuance,  including  syiubolic  analysis  of 
execution  frequencies.  However,  since  he  does  not  utilize  profile  data,  the  User  must 
answer  questions^  of  the  form  What  is  the  average  size  of  s*t  in  line  2159]  or  even 
What  is  the  expected  number  of  iterations  in  an  average  execution  of  the  loop  starting 
at  line  12359.  Even  worse  examples  of  the  kinds  of  dialogue  the  systeiu  forces  on  the 
User  are  questions  about  probabilities:  What  is  the  probability  of  the  CASE  statement 
of  line  1113  taking  the  alternative  of  line  11269  It  seems  extremely  doubtful  to  me 
that  a  User  would  know  this  information  with  any  precision  or  confidence  without 
profile  data. 

Weiss  [51]  worked  on  finding  types  of  recursive  SETL  variables,  and  pre¬ 
sented  methods  for  implementing  such  structures.  However,  he  does  not  worry  about 
selection  of  ‘best’  implementations  by  numeric  criteria. 

Sherman’s  dissertation  [44]  presents  a  very  comprehensive  approach  to  the 
problem  through  language  design.  The  primary  contribution  of  his  programming 
language  Paragon  is  the  idea  that  implementations  are  subclasses,  or  refinements,  in 
the  ADT  hierarchy  (with  multiple  inheritance).  That  is  to  say,  an  implementation  is 
just  a  refinement  of  an  abstract  data  type  and  is  specified  using  the  same  notation  as 
that  used  to  specify  the  abstraction.  Paragon  is  an  ambitious  system  that  attempts 
to  solve  many  problems  at  once,  including  selection  of  a  refinement  of  an  ADT  based 
solely  on  the  program  text.  Presumably,  profile  data  could  be  used,  but  he  does 
not  discuss  this  in  any  depth.  In  the  Paragon  luodel,  the  User  (our  terminology)  is 
responsible  for  writing  the  coiuplete  evaluation  function  (Sherman  calls  it  the  policy 
procedure)  that  selects  the  iiupleiuentations  of  the  variables  of  the  program.  This 
puts  the  onus  of  selection  on  the  wrong  meiuber  of  our  programming  duo.  We  have 
atteiupted  to  design  a  system  that  puts  the  onus  of  iiupleiuentation  evaluation  on 
the  Impleiuentor,  and  selection  of  iiuplementations  for  functions  and  variables  on 
the  system,  not  on  the  User. 

Rowe’s  system  [38]  approached  the  problem  froiu  the  direction  of  selecting 
an  implementation  froiu  a  description  of  the  desired  data  relations  and  functionality. 
His  modeling  domain  language  is  implementation  independent  and  is  used  to  search 
for  implementations  that  satisfy  the  described  relations  and  operations.  In  those 
cases  where  there  does  not  exist  an  implementation  satisfying  the  description,  Rowe 

Whe  questions  are  taken  from  his  dissertation. 
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investigated  ways  of  generating  an  implementation.  However,  he  did  not  investigate 
the  use  of  profile  data  in  his  work. 


4.3  Typesetter:  The  System 

In  the  discussion  of  the  TYPESETTER  system,  it  is  useful  to  distinguish 
between  different  luodes  of  programiuing  which  I  identify  by  reference  to  two  different 
programmer  groups:  the  Users  and  the  Implementors.  Since  Users  need  not  be 
as  facile  with  the  theory  behind  TYPESETTER  as  Iiupleiuentors  need  to  be,  the 
distinction  between  theiu  is  not  one  siiuply  of  convenience  of  notation.  Users  write 
prograius  that  utilize  the  abstractions  provided  by  the  systeiu;  the  system  selects 
froiu  aiuong  the  iiuplementations  installed  by  the  Implementors. 

Examples  in  the  following  sections  are  based  on  the  existing  prototype, 
described  in  greater  detail  starting  in  Section  4.4.  Language  enhancements  required 
to  support  Typesetter  are  not  extensive  and  the  notations  should  be  relatively 
transparent  to  anyone  who  has  used  an  object-oriented  prograiuming  language  like 
C++. 

4.3.1  Formalities 

An  abstract  data  type  (ADT)  is  a  set  of  function  signatures  indexed  by 
function  naiues.  For  each  ADT,  there  is  a  set  of  representation  types;  we’ll  write 
^  7^  to  mean  that  ADT  A  can  be  represented  by  representation  TZ]  implf  A)  is  the 

set  of  possible  representations  of  A.  A  function  signature  has  the  foriu  To,Ti, .  . .  ,T„, 
for  n  >  0  and  types  T.  By  convention,  Tq  is  the  type  returned  by  the  function.  For 
all  signatures  in  ADT  A,  the  T  are  theiuselves  ADTs. 

For  each  abstract  function  in  an  ADT,  there  is  a  set  of  implementation 
functions.  Like  their  abstract  function  counterparts,  implementation  functions  con¬ 
sist  of  a  name  and  a  signature.  But  where  the  abstract  functions’  paraiueter  types 
Ti  are  ADTs,  the  iiupleiuentation  functions’  parameter  types  Tf  are  representation 
types.  Furthermore,  T  Tf  for  all  i. 

Our  task  is  to  assign  a  representation  type  to  each  variable  in  a  user’s 
prograiu,  and  hence  an  implementation  function  to  each  call  site  in  the  prograiu. 

Define  a  prograiu  to  be  a  set  of  variables  V  and  function  call  sites  C.  Each 
variable  u  G  V  has  been  declared  to  have  one  of  the  ADTs  in  T.  atype(v).  Each 
function  call  site  c  G  C  consists  of  the  name  of  an  abstract  function,  absfcn(c),  and 
a  list  of  actual  argument  variables  actuals (c).  When  an  implementation  function 
is  assigned  as  the  implementation  of  the  abstract  function  at  a  call  site,  then  the 
implementation  type  of  vj,  the  actual,  is  assigned  to  be  T'j,  the  type  in  the 
signature  of  the  implementation  function;  i.e.,  itype{vj)  =  T'j. 
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4.3.2  The  ideal  system 

The  prototype  system  is  a  subset  of  a  larger  vision.  I  have  argued  previously 
that  profile  data  should  be  collected  all  the  time,  and  that  the  overhead  for  doing  so 
is  quite  minimal.  Then  the  programming  system  would  allow  the  user  to  develop  a 
debugged  and  efficient  program  in  the  following  steps. 

During  initial  implementation  and  debugging,  the  User  rarely  needs  to  de¬ 
cide  the  implementation  of  many,  if  not  most,  of  the  abstractions  used  in  the  program. 
The  Typesetter  system  will  utilize  the  simple  program  execution  counts  described 
in  chapter  2  to  pick  a  reasonable  implementation  for  the  abstractions. 

As  the  User’s  system  is  implemented,  and  its  structure  is  tested  with  more 
complex,  more  complete,  and  perhaps  larger  sets  of  input  data,  the  actual  imple¬ 
mentation  of  the  abstractions  becomes  more  important  (if  for  no  other  reason  than 
that  debugging  a  slow  program  can  be  particularly  aggravating).  By  this  time,  the 
User  will  have  gained  enough  experience  with  the  program  that  he  can  conjecture 
which  implementations  of  the  abstractions  may  be  reasonably  efficient  for  the  pro¬ 
gram.  This  conjecturing  is  important  only  to  the  extent  that  it  allows  the  User  to 
determine  what  additional  information  might  be  helpful  to  supply  about  the  abstrac¬ 
tions:  Typesetter  will  determine  which  implementations  are  actually  best.  This 
additional  information  is  supplied  as  part  of  the  declarations  of  the  variables,  and  is 
discussed  in  more  detail  in  section  4.3.5. 

It  is  also  true  that  simple  execution  counts  are  insufficient  for  selecting  an 
implementation.  Alternative  implementations  of  abstractions  are  created  by  pro¬ 
grammers  to  take  advantage  of  the  interaction  of  properties  of  the  abstractions  with 
specific  properties  of  sets  of  input  data.  Therefore,  to  determine  that  a  bit-mapped 
implementation  of  a  set  is  preferred  over  a  linked-list  implementation  requires  know¬ 
ing  not  only  how  the  program  makes  use  of  the  data  (e.g.,  number  of  insertions  vs. 
number  of  deletions;  the  mix  of  element  access  and  destructive  operations;  etc.)  but 
also  requires  some  information  about  the  input  data  itself  (e.g.,  is  it  read  in  increas¬ 
ing/decreasing  order;  is  it  ‘sparse’;  is  it  locally  dense;  etc.).  This  information  can 
only  be  gathered  directly  and  intentionally,  and  cannot  be  inferred  from  execution 
counts  except  at  great  expense,  if  at  all. 

At  this  point  in  the  development  process,  the  abstractions  in  the  program 
are  assigned  profiling  implementations,  and  a  couple  of  runs  of  the  instrumented  pro¬ 
gram  (over  whatever  data  it  can  handle  at  this  stage  of  development)  will  provide 
further  data  upon  which  TYPESETTER  can  assign  more  efficient  implementations. 
Once  implementations  are  determined  based  on  this  data,  development  and  debug¬ 
ging  of  the  program  can  proceed  using  this  more  appropriate  implementation  of  its 
abstractions. 

The  Typesetter  prototype  described  here  has  concentrated  on  imple¬ 
menting  only  the  intermediate  step:  using  profiling  implementations  of  abstractions 
to  collect  abstraction-specific  profile  data  that  TYPESETTER  can  use  to  select  from 
among  a  set  of  implementations.  Lacking  any  profile  information,  the  prototype  al- 
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ways  links  in  the  profiling  implementations  for  all  abstractions;  running  the  program 
then  generates  profile  data  for  selecting  more  efficient  implementations  on  a  future 
run  of  the  system. 

4.3.3  ADTs 

Among  the  ADTs  available  to  programmers  in  TYPESETTER  there  are  the 
usual  bnilt-in  data  types  (e.g.  integer,  real,  structures,  arrays,  pointers,  etc.),  each  of 
which  has  a  fixed  implementation  in  User’s  programs,  and  a  library  of  more  complex 
abstract  data  types  (ADTs)  with  implementations  of  functions  that  make  up  their 
interface  tailored  to  specific  representations.  Users,  however,  make  no  reference  to 
specific  iiupleiuentations  of  the  functions  or  variables;  they  siiuply  luake  use  of  the 
publicly  declared  interface  (or  protocol)  of  the  ADT.  The  compiler  will  then  choose 
iiupleiuentations  for  the  variables  and  functions  to  minimize  a  cost  function  based 
on  data  collected  by  a  profiling  version  of  the  ADT.  TYPESETTER  supplies  three 
ADTs:  sets,  lists,  and  maps.  (These  correspond  closely  to  the  SETL  data  types  of 
finite  sets,  tuples,  and  maps  [43].)  Their  definitions  below  are  in  TYPESETTER’S 
C++  dialect;  in  particular  the  first  argument  to  a  function  is  understood  to  be  a 
pointer  to  the  object  by  which  the  function  is  invoked. 

Sets  The  Set  abstract  data  type  is  generic  in  the  type  of  the  contained  objects, 
which  is  denoted  as  Any.  Figure  ??  contains  the  definition  of  the  Set  ADT,  and  Fig¬ 
ure  4.2  lists  some  possible  implementations  of  exomorphic  sets.  Those  with  asterisks 
are  currently  implemented  in  the  prototype.  There  are  many  possible  representa¬ 
tions  of  Sets,  a  few  of  which  are  briefly  described  in  Figure  4.2.  (The  reader  may 
wish  to  compare  this  list  with  the  implementations  provided  by  the  SETL  compiler; 
see  pg.  57.) 

Lists  The  List  abstraction  is  generic  in  the  type  of  the  contained  objects.  Figure  4.3 
lists  the  functions  comprising  the  interface  to  the  ADT.  There  are  many  possible 
representations  of  Lists,  a  few  of  which  are  described  briefly  in  Figure  4.4. 

Maps  Maps,  or  finite  functions,  are  generic  in  the  type  of  the  domain  element  and 
the  type  of  the  range  eleiuent.  Figure  4.5  lists  the  functions  foriuing  the  interface 
to  the  ADT.  and  Figure  4.6  lists  possible  iiupleiuentations.  Figure  4.6  lists  a  few  of 
the  many  possible  implementations  of  Maps. 
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^Linked  list  . 

Doubly-linked  list  . 

^Sorted  linked  list:  A  linked  list  on  which  the  elements  are  maintained  in  sorted 
order. 

*Bit  vector:  Elements  must  be,  or  map  to,  integers.  Requires  knowing  max  and 
min  integers.  Three  variations:  nofElements  kept  as  part  of  the  set,  fast  array 
lookup  element  count  for  bytes,  ditto  for  words. 

Hash  table:  Useful  when  the  key  is  not  an  integer,  but  an  arbitrary  collection  of 
bits.  Information  about  the  range  of  the  hash  function  and  the  density  of  the 
resulting  hash  values  would  help  select  good  paraiueters  for  the  hash  table.  For 
sets  of  arbitrary  objects,  the  progranuuer  luust  supply  a  hash  function. 

Sorted  array:  Keeps  a  sorted  list  of  the  actual  elements  of  a  set.  Requires  knowing 
max  and  min  elements;  knowing  the  luaximum  size  of  a  set,  and  the  average 
size  of  sets  may  help  select  better  paraiueters  for  the  implementation. 

Sorted  variable  length  array:  Ditto.  Requires  extra  overhead  for  the  dope  vec¬ 
tor. 

Linked  array:  Requires  knowing  max  and  min  elements,  as  well  as  the  fact  that  the 
elements  tend  to  cluster.  Optimizes  space  at  the  expense  of  time.  To  be  used 
in  environments  where  reallocating  sets  due  to  growth  or  memory  compaction 
may  be  more  expensive  than  just  chasing  pointers. 


Figure  4.2:  Possible  implementations  of  sets  {*  in  prototype) 
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List():  The  constructor  of  List. 

boolean  empty  ():  Returns  true  if  the  iist  is  empty. 

void  makeEmptyO:  The  iist  is  emptied. 

boolean  in(Any  elt):  Returns  true  if  elt  is  in  the  iist. 

int  cardinality  () :  Returns  the  size  of  the  iist. 

void  rest  (List  L):  Removes  the  first  eiement  of  the  iist. 

Any  first  (List  L):  Returns  the  first  eiement  on  the  iist. 

append(Any  e):  The  eiement  e  is  appended  to  the  iist. 

prepend  (Any  e):  The  eiement  is  pushed  onto  the  front  of  the  iist. 

delete  (Any  e):  Aii  instances  of  the  eiement  e  are  removed  from  the  iist. 

sort(CmpFcn  f  (Any, Any)):  The  iist  is  sorted  in  piace  using  the  comparison  func¬ 
tion  /. 

iterinitdterator  i):  Initiaiize  an  iterator  over  the  iist. 

iterate  (Iterator  i.  Any  &elt):  Assign  elt  the  next  eieiuent  of  the  iist  in  the 
iteration  and  return  true,  eise  return  false. 

iterDonedterator  i):  Return  true  if  an  invocation  of  iterate  wouid  return 
false. 

iterCleanupdterator  i):  Return  resources  aiiocated  to  the  iterator. 
iterCopydterator  i.  Iterator  &j ):  The  iterator  i  is  copied  into  a  new  iterator 

j- 


Figure  4.3:  Specification  of  List  ADT 
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^Linked  list. 

Doubly- linked  list. 

Fixed-length  array:  Each  list  has  an  array  of  maximum  size  allocated  for  it. 

Linked  array:  The  list  is  kept  in  a  list  of  arrays,  each  sub-array  allo¬ 
cated/deallocated  as  the  list  is  manipulated.  Requires  knowing  max  and  min 
elements,  as  well  as  the  fact  that  the  elements  tend  to  cluster.  Optimizes  space 
at  the  expense  of  time.  To  be  used  in  environments  where  reallocating  lists 
due  to  growth  or  memory  compaction  may  be  more  expensive  than  just  chasing 
pointers. 


Figure  4.4:  Possible  implementations  of  lists  {*  in  prototype) 
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Map():  The  constructor  of  Map. 

boolean  empty  ():  Returns  true  if  the  map  is  empty. 

void  makeEmptyO:  The  map  is  emptied. 

boolean  inDomain(Any  elt):  Returns  true  if  e  is  in  the  domain  of  the  map. 

boolean  inRange(Any  elt):  Returns  true  if  e  is  in  the  range  of  the  map. 

int  cardinality  () :  Returns  the  size  of  the  map  the  number  of  eleiuents  defined 
in  the  range). 

define  (Any  d.  Any  r) :  The  element  d  is  added  to  the  doiuain  of  the  map  so  that 
it  returns  the  element  r. 

delete  (Any  e) :  All  instances  of  the  element  e  in  the  doiuain  are  removed  from  the 
map. 

sort  (CmpFcn  f  (Any ,  Any) ) :  The  map  is  sorted  in  place  using  the  comparison  func¬ 
tion  /. 

iterinitdterator  i):  Initialize  an  iterator  over  the  map. 

iterate  (Iterator  i.  Any  &d.  Any  &r):  Assign  d.  the  next  element  in  the  do¬ 
main  of  the  map,  r  the  corresponding  element  in  the  range,  and  return  true; 
else  return/a/se. 

iterDonedterator  i):  Return  true  if  an  invocation  of  iterate  would  return 
false. 

iterCleanupdterator  i):  Return  resources  allocated  to  the  iterator. 

iterCopy (Iterator  i.  Iterator  &j ):  The  iterator  i  is  copied  into  a  new  iterator 

j- 


Figure  4.5:  Specification  of  Map  ADT 
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^Linked  map:  The  map  may  be  singly  or  doubly  linked,  and  consists  of  pairs  of 
elements  as  related  by  calls  to  define. 

Fixed-length  array:  Each  map  has  an  array  of  maximum  size  allocated  for  it,  if  a 
maximum  size  is  know.  The  array  is  two  dimensional,  one  each  for  the  domain 
and  range. 

Linked  array:  The  map  is  kept  in  a  map  of  arrays,  each  sub-array  allo¬ 
cated/deallocated  as  the  map  is  manipulated. 

Hash  table:  Hashed  by  domain  elements  for  faster  lookup. 

Binary  tree:  So  seeks  on  domain  elements  are  O(logn),  for  n  the  number  of  ele¬ 
ments  in  the  domain. 


Figure  4.6:  Possible  implementations  of  maps  {*  in  prototype) 
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4.3.4  Iterators 

In  a  complete  language  definition,  much  of  the  functional  interface  for  itera¬ 
tion  would  be  hidden  from  the  Users  with  some  syntactic  sugar.  I  envision  something 
close  to  the  Alphard  paradigm  for  iterators,  and  the  User  would  write  something  like 
the  following  to  have  the  compiler  invoke  the  appropriate  iteration  functions. 

Set{SomeType)  S] 

for  i  in  S  do 

... - use  of  element  i 

endfor; 

The  above  code  would  be  translated  into  something  like  the  following  using  the 
Typesetter  iterator  paradigm: 

Set{SomeType)  S] 

Iterator  S-iter] 

S.iterInit{S-iter)] 

while  S.iterate{S-iter,i)  do 

end  while; 

S  .iterCleanup[S-iter)] 

The  iterators  are  associated  with  the  object  being  iterated.  Their  exact  form  is  never 
available  at  the  User  level  and.  therefore,  the  ADT  implementation  is  free  to  create 
the  iterator  object  necessary  to  successfully  traverse  the  aggregate  type.  In  other 
words,  every  ADT  exports  the  Iterator  type,  but  not  the  internals  of  the  Iterator 
type. 

4.3.5  Optional  parameters 

Users  should  be  able  to  write  simple  declarations  of  their  program  variables 
and  have  the  system  select  an  appropriate  implementation  of  those  variables  based 
on  that  declaration  and  on  knowledge  of  the  behavior  of  the  program  containing 
those  declarations. 

Set{Bar)  foo] 

declares  that  the  variable  foo  contains  a  set  of  objects  of  type  Bar.  TYPESETTER 
recognizes  several  objects  in  this  declaration.  The  first  is,  of  course,  the  use  of  foo, 
which  is  the  naiue  of  the  variable  being  declared,  and  the  second  is  the  name  of 
the  ADT.  The  reiuaining  parameters  supply  information  to  the  ADT  itself,  and  are 
of  two  types:  those  required  for  even  a  minimal  iiupleiuentation  of  the  ADT,  and 
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optional  parameters  which  supply  further  information  that  may  enable  (or  exclude) 
implementations  for  consideration.  Required  parameters  are  positional,  and  optional 
parameters  are  named.  In  our  example,  Bar  is  the  name  of  a  User  type  and  is 
required  by  all  implementations  of  Sets.  A  declaration  of  a  variable  of  type  Map  has 
two  required  parameters,  the  type  of  the  key,  and  the  type  of  the  data.  For  example, 

Map(ri,r2)  vmap: 

declares  vmap  to  be  a  mapping  from  key  objects  of  type  T1  to  data  objects  of  type 
T2. 

Use  of  the  variables  follows  standard  object-oriented  form.  For  example, 
iset.add{3)', 

adds  the  element  ‘3’  to  the  set  named  iset. 

Optional  parameters  (or  just  optionals)  are  not  required  for  an  implemen¬ 
tation  to  be  assigned  to  a  variable:  there  is  always  at  least  one  implementation  of  an 
ADT  that  can  be  assigned  to  any  variable  of  that  type.  Optionals  supply  information 
that  allow  the  system  to  consider  other,  possibly  more  efficient,  implementations  for 
a  variable.  For  instance,  consider: 

Set(int)  iset] 

As  declared,  the  variable  iset  could  be  implemented  with  one  of  a  variety  of  bit-map 
implementations,  but  with  only  those  implementations  that  can  handle  bitmaps  of 
unknown  and  possibly  varying  size,  and  perhaps  even  negative  values  as  elements. 
This  implies  a  relatively  complex  implementation  of  bitmaps.  If  the  User  were  aware 
that  the  only  correct  integer  values  for  this  set  were  positive,  that  information  could 
be  provided  with  the  optional  parameter  lowerb: 

Set(int  ,lowerb=0)  iset] 

The  optional  parameters  are  named  parameters.  The  additional  information  they 
provide  could  enable  the  consideration  of  other  bit-mapped  implementations  that 
need  it.  Obviously,  the  more  information  provided  about  the  user’s  objects,  the 
more  implementations  that  can  be  considered. 

Consider  the  following  declaration: 

Set{Utype)  uset] 

The  User  has  declared  a  set  of  objects  of  type  Utype,  a  user-declared  type.  In  this 
situation,  bit-mapped  implementations  are  not  at  all  feasible  since  the  compiler  has 
no  way  of  mapping  objects  of  type  Utype  onto  integers.  If  such  a  mapping  is  possible, 
the  User  may  declare  the  mapping  function  and  its  inverse: 

Set(  Utype,objToInt=f,intToObj=g)  uset] 
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The  system  is  now  free  to  consider  bit-mapped  implementations  as  before. 

In  an  ideal  system,  the  User  might  be  given  hints  as  to  which  optionals 
might  provide  performance  improvements.  How  such  hints  might  be  automatically 
generated  is  a  topic  for  further  research.  In  the  TYPESETTER  system,  the  Imple¬ 
mentor  is  responsible  for  providing  the  documentation  for  the  optionals  supported 
by  an  implementation.  That  documentation  will  include  a  general  description  of  the 
possible  effects  on  TYPESETTER’S  ultimate  choices  of  implementations.  That  is,  we 
depend  on  documentation  to  help  Users  decide  which  optionals  might  be  beneficial 
for  their  prograius. 

The  actual  number  of  optionals  required  for  the  prototype  has  been  few. 
Table  4.7  describes  optionals  envisioned  as  useful.  Asterisks  mark  the  optionals 
actually  iiupleiuented  in  the  prototype.  The  optionals  in  the  table  fall  into  one  of 
two  categories:  those  that  provide  inforiuation  that  would  difficult  to  derive  from 
the  prograiu  even  with  profile  data,  and  those  that  are  a  convenience  for  the  cur¬ 
rent  implementation  but  could  be  eliminated  with  appropriate  progranuuing  by  the 
Iiupleiuentor.  For  instance,  the  upperb  optional  for  set  impleiuentations  cannot  in 
general  be  derived  from  program  source  and  profile  data. 

The  declaration  optional  addedDecreasing  could  be  detected  at  run  tiiue  and 
encoded  in  the  profile  data.  If  different  input  data  were  fed  to  an  implementation 
that  attempted  a  more  efficient  representation  by  assuming  the  eleiuents  were  added 
in  order,  luore  than  likely  the  User’s  prograiu  would  run  slower,  but  would  not 
fail.  Of  course,  it  is  possible  to  design  an  implementation  whose  correct  operation 
depended  on  the  assumption,  in  which  case  detection  by  a  profiling  implementation 
would  not  be  sufficient:  a  contract  with  the  user  in  the  form  of  a  declaration  would 
be  required.  The  ‘order’  of  the  objects  in  this  example  is  an  internal  ordering;  if  the 
implementation  depends  on  a  User-defined  ordering,  then  another  optional  declaring 
the  order  function  would  need  to  be  defined  by  the  Implementor  and  declared  by  the 
User.  Specifically,  the  implementation  Set_slistord  takes  advantage  of  the  fact  that 
sometimes  a  program  creates  all  the  objects  that  are  in  a  set,  and  adds  them  in  the 
order  they  are  created.  This  often  results  in  the  heap  allocator  allocating  the  objects 
such  that  their  meiuory  addresses  correlate  with  their  time  of  creation.  Set_slistord 
atteiupts  to  cut  down  on  lookup  tiiue  of  elements  in  a  set  by  keeping  the  elements 
on  a  list  in  increasing  order  of  the  addresses  of  the  objects. 

4.3.6  Alternative  implementations 

The  User  sees  a  system  complete  with  a  set  of  possible  implementations  of 
abstract  data  types.  These  implementations  were  provided  with  the  system  and  are, 
conceptually  at  least,  part  of  the  system.  It  must  be  possible  for  Implementors  to 
specify  under  what  conditions  their  implementations  can  be  selected,  what  profile 
data  needs  to  be  collected,  and  how  that  data  is  to  be  evaluated. 

One  of  the  first  responsibilities  of  the  first  Implementor  of  an  abstraction  is 
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Optional 

value 

description 

Sets 

*  IntToObj 

function 

the  function  that  converts  integers  into  objects  of 
the  correct  tvpe  for  this  set 

*  ObjToInt 

function 

converts  objects  into  integers 

*  lowerb 

integer 

the  lower-bound  value  of  the  elements  of  a  set  of 
integers 

*  upperb 

integer 

the  upper-bound  value  of  the  elements  of  a  set  of 
integers 

nofElts 

integer 

the  number  of  base  elements  of  the  set;  must  equal 
npperb— lowerb-|-l.  if  thev  are  specified 

*  ObjsAreInts 

compareFcn 

(none) 

function 

declares  that  the  base  objects  of  this  set  are  in 
fact  integral,  and  the  compiler  will  perform  the 
correct  coercions:  this  is  a  convenience  optional 
for  the  prototype 

accepts  pointers  to  two  objects  and  returns  —1,0, 
or  1  depending  on  whether  the  first  is  less  than, 
equal  to.  or  greater  than  the  second. 

addedDecreasing 

(none) 

elements  are  added  in  decreasing  order 

addedincreasing 

(none) 

elements  are  added  in  increasing  order 

Lists 

maxLength 

integer 

User  contracts  that  list  will  never  be  longer  than 
this 

Maps 

IntToObj 

function 

the  function  that  converts  integers  into  objects  of 
the  correct  tvpe  for  this  set 

ObjToInt 

function 

converts  objects  into  integers 

lowerb 

integer 

the  lower-bound  value  of  the  elements  of  a  set  of 
integers 

npperb 

integer 

the  upper-bound  value  of  the  elements  of  a  set  of 
integers 

hasliFcn 

function 

returns  a  32-bit  integer  that  can  be  used  in  a  hash 
table  implementation  of  a  map 

compareF  cn 

function 

accepts  pointers  to  two  objects  and  returns  —1,0, 
or  1  depending  on  whether  the  first  is  less  than, 
equal  to.  or  greater  than  the  second. 

Figure  4.7:  TYPESETTER  optionals. 
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function  Set-P::add{Set  this,  any  e) 

{ 

profiler 

pent,  psizeSum,pwasIn] 

Link  Ip] 
pcnt-\--\-] 

psizeSum  +=  length: 

Ip  =  this^  first: 

while  {Ip  \=  nil  &&  e  !=  Ip^data)  { 
Ip  =  Ip^next] 

} 

if  {Ip  ==  nil)  { 

//  e  not  in  the  set 
Link  newp  =  new  Link] 
newp^data  =  e; 
newp^next  =  first] 
first  =  newp] 

} 

else  { 

pwasIn-\--\-] 

} 


Figure  4.8:  Profiling  implementation  of  add 


to  define  the  functionality  of  the  abstract  data  type  and  provide  its  first  implemen¬ 
tation.  This  first  implementation  must  also  be  the  profiling  implementation  for  this 
ADT,  and  it  must  be  the  most  general:  it  is  a  requirement  in  TYPESETTER  that  no 
ADT  is  defined  with  functions  in  the  interface  that  cannot  be  implemented  in  the 
profiling  implementation;  otherwise,  there  would  be  no  way  to  collect  profile  data 
about  that  function. 

Figure  4.8  shows  the  code  for  the  profiling  implementation  of  the  add-an- 
element  function  in  the  interface  for  sets.  This  implementation,  called  SePP,  uses 
a  very  general  structure  (in  this  case  a  linked  list)  to  ensure  that  any  function  in 
the  interface  can  somehow  be  implemented  and  profiled.  (The  converse  is  not  true: 
an  alternative  implementation  does  not  have  to  implement  every  function  in  the 
interface.  Any  program  using  that  function  could  not  have  that  implementation 
assigned  to  the  involved  variables,  however.) 

Profile  variables  (declared  as  profilers  in  Figure  4.8)  are  allocated  per  call 
site  in  the  User’s  program.  That  is,  if  the  User’s  program  calls  add  from  three  distinct 


71 


function  Set-bm::add{Set-bm  this,  any  e) 

{ 

int  i  =  {*this^objToInt){e)] 

int  w  =  {i  I  (sizeof(integer)*sizeof(byte))); 
int  b  =  {i  mod  (sizeof(integer)*sizeof(byte))); 

this^setbitslwl  |=  (1  <<  b)] 

} 


Figure  4.9:  An  alternative  implementation  of  add 

sites,  then  a  total  of  three  instances  each  of  pent,  psizeSum,  and  pwasin  are  allocated. 
On  each  call  of  the  add  function,  the  invocation  counter  pent  is  incremented,  and 
the  psizeSum  profile  variable  is  incremented  by  the  current  length  of  the  set.  From 
this  information,  evaluation  functions  can  compute  the  average  size  of  the  set  per 
call  per  call  site.  Finally,  it  may  be  useful  to  an  implementation  to  know  how  many 
times  add  was  invoked  to  add  an  element  that  was  already  a  member:  the  profiling 
variable  pwasin  allows  ns  to  compute  that  statistic.  As  other  implementations  are 
added  to  the  collection  of  implementations  for  sets,  more  information  may  need  to  be 
collected  by  the  profiling  implementation.  The  Implementor  of  an  implementation 
that  requires  new  profile  data  will  modify  the  profiling  implementation  to  collect  it. 

4.3.7  Feasibility  functions 

Continuing  with  the  example  of  the  add-an-element  function.  Figure  4.9 
shows  its  implementation  when  sets  are  implemented  as  a  bit  map.  Before  a  User’s 
variable  can  be  assigned  Set_bm  as  its  implementation,  TYPESETTER  must  first  check 
that  this  is  a  feasible  assignment.  Therefore,  the  Implementor  of  Set_bm  must  provide 
a  feasibility  function  that  TYPESETTER  can  call  to  check  feasibility.  The  function 
returns  either  true  or  false;  the  feasibility  function  for  a  bit-map  implementation 
having  32  elements  or  less  is  shown  in  Figure  4.10. 

4.3.8  Evaluation  functions 

Traditional  profiling  techniques  cannot  capture  the  wealth  of  detail  required 
for  intelligent  selection  of  implementations.  For  instance,  from  knowing  the  number 
of  times  allocation  and  deallocation  routines  are  executed,  it  is  extremely  difficult  to 
deduce  the  average  size  of  sets,  say,  at  any  particular  call  site  in  a  program.  With  our 
profiling  schema,  it  is  particularly  easy.  Furthermore,  rather  complex  information 
can  be  acquired  such  as  “Is  this  a  sparse  set?” ,  “Does  this  list  ever  have  items  deleted 
from  it?”.  “Are  the  elements  of  this  set  entered  in  any  particular  order  that  yields 
advantage  to  any  implementation?”,  etc. 
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FEASIBILITY 


if  (!m- 

^IntToObj^def  &&  \u- 

Obj  Points  def 

u— 

^ObjsArelnts^def  && 

u— 

^upperb— 

^def  &&  u^. 

dowerb— 

>def 

&& 

u— 

^upperb— 

^ivalid  &&  u—. 

>upperb— 

^ival  >  0 

&& 

u— 

^lowerb— 

>ivalid  && 

dowerb—. 

Aval  >  0 

u— 

^lowerb— 

>ival  <  u^upperb^ival 

&& 

(u- 

-^upperb- 

-^ival  —  u^lowerb^iva 

/)  <  32 

)  return  true; 
else  return  false; 

} 


Figure  4.10:  Feasibility  function  for  implementation  Set_bm 


Eval  Set_b  m::add{CaUSite  c) 

{ 


return  c.pcnt  * 

{idividePwr‘2_op  +  modPwr‘2_op  +  orAssign_op  +  array_op  +  shift_op); 

} 


Figure  4.11:  Evaluation  function  for  add 


Typesetter  determines  which  implementation  of  an  ADT  is  best  based 
on  the  estimates  returned  by  the  evaluation  functions  supplied  with  each  implemen¬ 
tation.  For  each  function  in  the  interface  of  an  ADT,  the  Implementor  must  supply 
an  Eval  function.  For  example,  the  evaluation  function  for  Set_bm::add  is  in  Fig¬ 
ure  4.11.  When  called  with  a  call  site  as  a  parameter,  these  functions  return  an 
estimate  of  the  runtime  resources  required  by  this  implementation. 

The  variable  pent,  declared  in  the  profiling  implementation  as  a  profile 
variable,  is  used  here  to  estimate  how  much  time  this  implementation  of  add  would 
take  at  a  particular  call  site.  The  other  profiler  variables  are  not  used  for  evaluating 
this  implementation  of  add.  The  variables  in  Figure  4.11  ending  in  Top’  are  constants 
that  estimate  the  relative  execution  times  of  each  of  the  indicated  high-level  language 
operations.  These  times  will  in  general  be  approximate,  if  only  because  the  execution 
time  of  any  construct  is  context-dependent.  However,  the  purpose  of  these  constants 
is  merely  to  provide  an  estimate  of  the  time  required  by  this  function  relative  to  other 
functions  implementing  the  same  functionality  for  different  representations.  It  is  my 
assertion  that  such  an  evaluation  technique  is  ‘close  enough’  to  allow  reasonably 
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‘correct’  assignment  of  implementations.  It  pays  for  itself  by  allowing  the  library  of 
implementations  to  be  ported  to  a  new  system  by  simply  changing  the  values  of  the 
indicated  constants,  if  necessary. 

Therefore,  all  of  the  examples  here,  and  all  of  the  implementations  in  the 
prototype,  estimate  the  amount  of  time  used  in  a  direct  fashion  (assuming  infinite 
physical  memory,  etc.),  and  do  not  attempt  to  handle  the  complexities  of  more 
precise  estimates  of  performance. 

Not  all  functions  are  as  straightforward  as  the  add  example  above.  Con¬ 
sider  the  function  for  computing  the  union  of  two  sets.  The  TYPESETTER  code  for 
the  Set  jilist  implementation  of  the  two-operand  union  function  (i.e.  the  union  of  the 
two  sets  is  assigned  to  one  of  the  sets)  is  shown  in  Figure  4.12.  This  implementation 
of  union  determines  for  each  element  in  the  set  sB  if  it  is  in  the  current  set  (the 
this  set,  in  C++  terminology).  If  not,  it  is  added  to  the  current  set.  The  evaluation 
function  for  this  function  must  therefore  have  access  to  the  evaluation  function  for 
the  in  function,  passing  to  it  the  information  it  needs  to  create  an  estimate  of  its 
behavior  at  the  call  site  in  the  union  function.  The  profiling  implementation  may 
not  have  invoked  the  in  function  (and  in  the  prototype’s  profiling  implementation  for 
sets,  it  doesn’t)  so  there  is  in  general  no  profile  data  specific  to  call  sites  contained 
in  the  alternative  (i.e.,  non-profiling)  implementations  of  an  abstraction’s  interface 
functions.  Therefore,  in  this  case,  the  union  evaluation  function  must  provide  esti¬ 
mates  for  the  profiling  data,  which  it  passes  as  parameters  to  the  evaluation  function 
for  in. 


4.4  Typesetter:  The  Implementation 

I  use  the  name  TYPESETTER  to  refer  to  the  whole  system  and  to  the  lan¬ 
guage  that  results  from  the  enhancements  made  to  C++.  The  actual  implementation 
of  Typesetter  consists  of  several  parts,  some  of  which  are  written  in  TYPESET¬ 
TER.  the  language.  The  extensions  to  C++  have  been  discussed  already,  and  are 
straightforward.  There  is  the  addition  of  evaluation  and  feasibility  functions  to  the 
declaration  of  class  member  functions  (see  pg.  71ff),  the  use  of  profiling  variables  (see 
pg.  70ff),  and  the  User’s  ability  to  optionally  declare  extra  inforiuation  in  a  variable 
declaration  (see  pg.  67ff). 

All  of  these  language  features  currently  take  the  foriu  of  macro  invocations. 
A  luacro  processor  first  scans  all  source  files  and  extracts  the  relevant  inforiuation 
via  the  macro  invocations.  This  information  is  then  fed  to  the  analysis  program 
which  makes  the  actual  implementation  decisions.  The  macros  are  written  in  m5 
[40],  a  powerful  macro  language  designed  for  the  manipulation  of  name-scoped  text, 
such  as  is  found  in  programming  language  text.  In  addition  to  the  alternative  im¬ 
plementations,  the  program  that  makes  the  actual  implementation  decisions  is  also 
written  in  TYPESETTER:  I  call  this  program  Therblig  after  Frank  Gilbreth’s  qual¬ 
itative  unit  of  work-motion  [14].  Figure  4.13  shows  the  steps  necessary  to  compile 
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void 

FUNCTIOM(unionl) (Set_slist  sB) 

{ 

Set_slist_Link  *lp  =  sB. first; 
while  (Ip  !=  NULL)  { 

if  (!Set _ in(lp->data) )  prepend (lp->data) ; 

Ip  =  lp->next; 

} 

} 

EVALSUB (Pent , PszA , PszB , Povrlp) 

{ 

<9<9  executed  Pent  times; 

there  were  Povrlp  elements  of  s2  already  in  si; 

<9®  each  time  loop  execM  avgBsz  times; 

OO  and  prepend  execM  avgNotIn  times; 
if  (Pent  ==  0)  return  0; 
double  avgBsz  =  PszB  /  Pent; 
double  avgin  =  Povrlp  /  Pent; 
double  avgMotIn  =  avgBsz  -  avgin; 
return  Pent  * 

(assign_op  //  startup 

+  (avgBsz  *  (cmpZero_op  +  deref_op  +  assign_op  +  not_op)) 
+  EVALSUBf or (in) (avgBsz,  PszA,  avgNotIn) 

+  (avgNotIn  *  EVALSUBf or (prepend) ())) ; 

} 

EVALUATE 

{ 

return  EVALSUB(p_cnt , p_szA, p_szB, p_ovrlp) ; 

} 

END_FUNCTION(unionl) 


Figure  4.12:  Set  union  using  linked  lists 
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a  program  written  in  TYPESETTER  into  C++.  The  first  invocation  of  m5  collects 
information  about  the  declaration  of  variables  and  the  call  sites  of  abstract  functions, 
and  emits  a  description  of  the  User’s  program.  Therblig  analyzes  this  description, 
along  with  profile  data  about  the  program  (if  it  exists),  and  emits  an  assignment 
of  implementations  for  all  variables  and  call  sites.  These  assignments,  along  with  a 
library  of  ADT  implementation  sources  and  the  original  source  of  the  User  program, 
are  processed  by  another  run  of  m5  to  transform  the  User  program  into  C++. 

Therblig  contains  all  of  the  functions  necessary  to  evaluate  the  use  of 
ADT  implementations  in  the  User’s  program,  including  the  evaluation  and  feasibility 
functions  provided  by  the  Iiupleiuentors  for  their  contributions  to  the  library  of 
implementations.  As  can  be  seen  in  Figure  4.14,  these  are  compiled  as  part  of 
Therblig  after  processing  of  the  iiuplementations  declared  for  each  of  the  ADTs 
that  are  part  of  the  systeiu. 

These  steps  are  discussed  in  more  detail  in  the  following  sections.  First,  I 
will  discuss  the  algoritluu  used  in  THERBLIG  to  assign  implementations  to  variables. 
Next.  I  will  present  the  luechanism  used  in  TYPESETTER  for  iiuplementing  code 
sharing  aiuong  the  implementations  assigned  to  the  User’s  variables. 


4.4.1  The  Implementation  Selection  Algorithm 

In  the  concluding  chapters  of  his  dissertation.  Low  [31]  observed  that  his 
hill-climbing  heuristic  seemed  to  display  the  property  that  iiupleiuentation  decisions 
were  made  early  and  were  rarely  re-iuade.  Straub  [46]  notes  that  even  though  his 
representation  selection  algoritluu  was  run  many  tiiues  with  widely  varying  expected 
values  for  the  prograiu’s  variables,  “the  choice  of  data  structures  luade  by  the  systeiu 
tended  to  be  independent  of  the  responses  to  the  queries  made  to  the  user.”  He 
concluded  that  this  indicates  that  the  selection  depends  much  more  on  the  operations 
performed  than  on  the  expected  values  of  the  program  variables.  (It  is  curious  that 
he  does  not  even  consider  collecting  profile  data  as  a  better  source  of  information 
than  interactive  querying  of  the  user.  Nor  does  he  consider  profile  data  as  a  source 
for  finding  those  important  operations.) 

Apparently,  the  representation  selection  process  is  being  made  much  more 
complicated  than  it  really  is:  good  selections  depend  on  a  small  part  of  the  infor¬ 
mation  that  has  been  utilized  in  previous  research.  My  hypothesis  is  that  this  is 
due  to  the  90-10  nature  of  most  programs.  That  is,  if  more  weight  is  given  to  the 
more  frequently  executed  sections  of  code  (as  they  are  in  Low’s  technique)  then  the 
implementation  costs  of  these  sections  will  dominate  the  overall  execution  costs  of 
the  program.  It  also  means  that  implementation  decisions  that  are  good  for  these 
‘hot  spots’  will  be  good  for  the  whole  program,  and,  conversely,  implementation  de¬ 
cisions  for  very  infrequently  executed  code  have  little  effect  on  the  performance  of 
the  prograiu. 

This  observation,  bolstered  by  the  observations  of  previous  researchers,  plus 


Figure  4.13:  Steps  to  process  a  User  program 


the  success  of  the  Greedy  Sewing  algorithm  for  code  reorganization  (covered  in  Chap¬ 
ter  3),  pins  Hansen’s  success  with  focused  optimization  [17],  pins  the  time-honored 
success  of  hand-tailored  optimizations  based  on  profile  data  all  suggest  that  in  gen¬ 
eral  a  greedy  algorithm  using  profile  data  will  work  for  assigning  implementations 
to  abstract  data  types.  This  results  in  a  two  stage  process:  stage  one  decides  which 
of  the  existing  implementations  are  feasible,  and  stage  two  chooses  from  among  the 
feasible  implementations  the  one  that  minimizes  the  cost  of  running  the  program. 
Furthermore,  the  algorithm  does  not  require  analysis  of  the  program  flow  graph,  as 
does  Low’s,  Rowe’s,  and  Straub’s  techniques:  it  reduces  to  a  problem  of  matching 
function  implementations  with  function  call  sites. 

Three  questions  must  be  answered  to  find  an  implementation  for  a  variable. 

1.  Does  enough  information  exist  to  take  advantage  of  a  representation?  That 
is,  does  the  information  exist  that  would  allow  variable  v  to  be  implemented 
with  implementation  i??  For  example,  if  a  list  is  implemented  as  a  fixed  size 
array,  we  have  to  know  the  maximum  size  of  the  list.  If  the  maximum  size 
of  a  list  is  unknown,  or  there  is  no  maximum  size,  then  the  fixed-size  array 
implementation  of  a  list  is  not  feasible. 

2.  Does  enough  of  a  representation  exist  to  implement  a  variable?  That  is,  if 
variable  v  is  given  implementation  i?,  for  each  call  site  c  which  has  v  in  its 
actual  parameter  list,  does  there  exist  an  implementation  /  of  consistent 
with  that  and  all  previous  assignments  of  implementations? 

3.  Which  implementation  of  variable  v  will  provide  the  best  performance  for  the 
program? 

The  first  two  questions  come  under  the  realm  of  feasibility,  is  it  possible  to 
select  an  implementation  for  the  program?  The  last  question  seeks  to  find  an  imple¬ 
mentation  that  minimizes  the  cost  of  executing  the  program,  using  the  Implementor- 
provided  evaluation  functions  as  objective  functions  for  that  minimization. 

The  assignment  algorithm  which  I  have  implemented  is  in  Figures  4.15 
through  4.19.  The  pseudo-code  in  is  meant  to  be  descriptive  rather  than  a  rigorous 
program  in  a  well-defined  language.  A  description  of  some  functions  that  are  not 
otherwise  defined  can  be  found  in  Figure  4.19.  The  code  is  not  specific  as  to  how 
the  assignment  of  variables  and  functions  are  recorded.  We  assume  a  global  data 
structure  assignment  contains  a  consistent  assignment  when  chooseimplementation 
returns,  or  it  is  empty  if  no  assignments  were  possible. 

The  basic  procedure  is  summarized  as:  sort  all  call  sites  in  decreasing 
order  by  some  preliminary  metric  and  assign  implementations  to  variables  based 
on  the  cheapest  implementation  of  the  functions  called  at  each  site.  While  I  have 
emphasized  the  evaluation  functions  for  alternative  implementations,  the  profiling 
implementation  for  each  ADT  also  has  evaluation  functions  that  are  used  to  estimate 
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proc  chooseImplementation{Set{caUSite)  C) 
Ijist{caUSite)  S] 

S  =  sortByImportance{C)] 

if  not  assignable{S)  then  error  fi 

endproc 


Figure  4.15:  Main  routine  for  choosing  representations 


the  potential  impact  of  a  call  site.  Consider  a  program  that  creates  a  set  of  n  objects, 
sorts  the  set  into  a  list,  and  then  accesses  the  objects  in  the  list.  The  call  site 
(assume  there  is  only  one)  that  adds  elements  to  the  set  is  executed  n  times.  The 
list  is  accessed  kn  times,  for  some  integral  k.  But  the  sort  routine  is  called  exactly 
once.  Nevertheless,  that  sort  routine  has  complexity  O(nlogn)  to  O(n^),  meaning  it 
has  the  potential  of  swamping  the  significance  of  the  add-an-element  function.  The 
evaluation  functions  of  the  profiling  implementation  return  values  that  reflect  this 
potential  impact  of  each  function. 

The  recursive  selection  algorithm  is  encoded  primarily  in  the  function  assignable 
(Figure  4.16)  which  is  passed  a  list  of  call  sites;  the  first  call  site  c  on  the  list  will 
be  assigned  an  implementation,  if  possible.  That  is,  at  call  site  c  where  the  abstract 
function  is  invoked,  assignable  will  pick  the  (next)  cheapest  implementation  of  F^, 
which  thereby  determines  the  types  of  the  variables  that  are  passed  as  parameters  to 
Fc-  The  hmction  findCompatible  (Figure  4.18)  finds  all  implementations  of  F^  that 
are  compatible  with  this  call  site.  This  set  of  compatible  implementations  is  then 
sorted  in  increasing  order  of  the  estimated  costs  provided  by  the  implementations’ 
evaluation  functions. 

For  each  function  in  the  sorted  sequence  of  implementations  S,  if  the  ac¬ 
tual  parameters  to  F  can  be  assigned  the  implementation  types  required  by  the 
implementation  function  /,  then  an  assignment  is  attempted  on  the  next  call  site 
on  list  C.  Otherwise,  we  back  out  of  any  implementation  assignments  made  in  this 
invocation  of  assignable,  and  another  implementation  function  /  on  S  is  tried  as  the 
implementation  function  for  this  call  site.  If  every  implementation  function  has  been 
tried  and  no  consistent  assignment  of  implementations  to  variables  and  call  sites  has 
been  made,  then  assignable  returns  false.  The  function  parmsimplementable  (Fig¬ 
ure  4.17)  determines  if  the  implementation  function  /  can  be  used  to  implement  the 
abstract  function  F  by  checking  that  the  actual  parameters  to  F  can  be  assigned 
the  implementation  types  required  by  the  formal  parameters  of  /. 
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function  assignable{List{caUSite)  C)  returns  boolean: 

if  isEmpty{C)  then  return  true; 

F  ^  absFcn{first{C)): 

S  ^  sortByCost{  findCompatible{  first{C)  )); 
foreach  f  on  S  do 

if  not  parmsImplementable{f,F)  or 

not  assignable{rest{C))  then - backtrack 

undo  Implementations ; 
else  return  true; 
fi 

endfor 
return  false; 
endfcn; 


Figure  4.16:  Routine  assignable  for  choosing  the  implementation  of  an  abstract 
function 


function  parmsImplementable{impFunc  f,  absFunc  F)  returns  boolean: 

foreach  v  G  signature{F ) ,  each  t  G  signature{f)  do - parallel 

if  not  already Impltd{v)  then 
if  implementable{v,t)  then 
implement  {v,t)'. 

else - conflict;  backtrack 

return  false; 
fi 
fi 

endfor 
return  true; 
endfcn 

function  implement  able  [variable  v,  implType  t)  returns  boolean: 
if  not  atype{v)  ^  t  then  return  false; 
if  not  t.feasible{v)  then  return  false; 
return  true; 
endfcn 


Figure  4.17:  Mapping  parameters  onto  implemented  functions’  signature 
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function  findCompatible{caUSite  c)  returns  Set{implType): 

I  ^  implementations{absFcn{c))', 

Ic  ^  empty  Set: 

foreach  f  E  I  do 

if  isCompatible{f,c)  then 
Ic  ^  IcU  {/}; 

enddo 
return  Ic: 
endfcn 

function  isCompatible{absFunc  f,  callSite  c)  returns  boolean: 
return  if  signature{f  )  actuals{c)  then  true  else  false] 

endfcn; 


Figure  4.18:  Finding  compatible  function  implementations 

The  function  implementable  checks  that  v  can  be  assigned  the  implemen¬ 
tation  type  t.  The  function  call 

t.feasible{v) 

calls  the  Implementor-supplied  feasibility  function  for  the  implementation  t  to  verify 
that  t  is  a  feasible  representation  for  v. 

Classes  of  variables  due  to  aliasing  in  user  functions 

User-declared  functions  require  some  twists  on  the  algorithm  as  we  have 
presented  it  so  far.  User-declared  functions  can  result  in  equivalence  classes  of  vari¬ 
ables  caused  by  aliasing  of  variables  with  formal  parameters  in  the  signature  of  these 
functions.  In  our  example  on  page  50  if  representation  B  is  assigned  to  FI,  then  it 
also  has  to  be  assigned  to  S‘2.  These  equivalence  classes  are  reminiscent  of  Low’s 
equivalence  classes;  however,  his  classes  were  imposed  by  the  design  decision  to  not 
allow  mixed  representation  functions  in  the  ADT  interfaces  and  were  much  more 
restrictive;  the  equivalence  classes  of  aliased  variable  names  are  much  less  so. 

Figure  4.20  contains  the  modifications  to  the  algorithm  to  handle  this  com¬ 
plication.  The  prime  difference  between  Figure  4.17  and  Figure  4.20  is  that  the 
former  looks  only  at  the  variable,  while  the  latter  looks  at  all  variables  that  are 
equivalent  to  the  variable  because  of  parameter  passing  to  calls  on  user  functions. 
This  acts  as  an  implementation  optimization  to  make  sure  implementation  selections 
are  compatible  without  having  to  backtrack. 
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ahsFcn{callSite  c)  ^ 

signature{absFunc  f) 
actuals{caUSite  c)  ^ 

sortByImportance{Set{caUSite)  C) 


sortByCost{Set{impFunc)  T) 


firs  t(List(  any)  T) 
implement{variable  v,  implType  t) 


- abstract  function  invoked  at  call  site  c; 

- signature  of  abstract  function; 

- list  of  actual  parameters  at  call  site  c; 

- list  of  call  sites  sorted  in  decreasing 

- order  of  the  results  of  the  evaluation  functions 

- in  the  profiling  implementation; 

- list  of  implementation  functions  sorted  in 

- increasing  order  of  estimated  execution  cost; 

- first  element  of  list; 


^ - update  implementation  list  contained 

- in  global  variable  assignment 

assignlmplTypefivariable  v,  implType  t,  signature  s) 

- assign  all  occurrences  of  variable  v 

- the  implType  t  in  the  signature  s 

assignlmplTypefivariable  v,  implType  t,  signature  s) 

- return  the  signature  s  with  all 

- instances  of  v  assigned  implementation  type  t 


Figure  4.19:  Miscellaneous  functions 


proc  parmsImplementablefiimpFunc  f,  absFunc  F)  returns  boolean: 
foreach  v  G  signaturefiF ) ,  each  t  G  signature{f  )  do 
if  not  alreadylmpltd{v)  then 

if  implementablefiwjt)  Vn;  G  class{v)  then 
implement {w,t)  Vn;  G  class[v)] 

else - conflict;  backtrack 

return  false; 
fi 
fi 

endfor 
return  true; 
endproc 


Figure  4.20:  Mapping  parameters  onto  implemented  functions’  signature  with  equiv¬ 
alence  classes 
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class  Set  { 

void  in  (Any  e); 

void  add{Any  e); 

void  subset{Set  s); 

}; 

Figure  4.21:  Declarations  of  generic  Set  functions 

4.4.2  Code  sharing 

For  each  user  variable  declared  to  be,  say,  a  set  of  some  user-type  UType^  a 
naive  implementation  of  the  generic  abstraction  of  sets  would  create  a  new  copy  of 
the  implementation  code  for  sets  with  all  instances  of  the  generic  parameter  replaced 
with  UType.  In  general,  this  is  quite  unnecessary,  and  particularly  so  in  the  context 
of  Typesetter  where  all  ADTs  are  restricted  to  be  exomorphic  and  UType  is 
restricted  to  be  a  pointer  to  a  user  type.  In  this  case,  code  can  be  written  once  to 
handle  sets  of  pointers  to  objects.  In  the  case  of  sets  of  various  sized  integers,  only 
one  copy  need  be  created  for  each  size  of  integer.  If  pointers  are  the  same  size  as 
one  of  the  sizes  of  integers,  the  same  version  of  Set  code  can  be  used  for  both. 

However,  it  is  important  not  to  give  up  strong  type  checking  to  gain  this 
savings  in  code  space.  Users  should  still  be  notified  when  their  programs  violate  the 
declarations  they  themselves  have  made. 

For  instance,  in  Figure  4.21  are  the  generic  declarations  of  some  of  the  in¬ 
terface  functions  for  sets.  The  abstraction  of  sets  is  generic  in  one  type,  that  of  the 
type  of  the  elements  of  the  sets.  If  the  user  declares  types  Token  and  String,  for  in¬ 
stance,  and  then  declares  variables  to  be  Set  (Token)  and  others  to  be  Set  (String) , 
it  would  be  wasteful  to  have  two  implementations  of  the  set  functions,  one  for  To¬ 
kens  and  one  for  Strings,  since  both  are  actually  implemented  as  a  set  of  pointers 
to  tokens,  and  set  of  pointers  to  strings.  All  that  is  needed  as  an  implementation  of 
sets  that  can  handle  pointers. 

All  that  is  needed  to  maintain  strong  type-checking  is  the  declaration  of 
coercion  types  for  a  set  of  Tokens  and  a  set  of  Strings,  each  of  which  calls  the  ap¬ 
propriate  function  for  handling  sets  of  pointers.  Therefore,  TYPESETTER  generates 
one  implementation  of  the  set  functions  capable  of  handling  pointers  (and  inciden¬ 
tally  four-byte  integers  on  many  systems)  and  then  generates  a  coercion  class  for 
maintaining  strong  type  checking.  The  declarations  for  an  implementation  of  ‘set 
of  pointers’  are  in  Figure  4.22;  Figure  4.23  contains  the  coercion  implementations. 
This  model  of  typing  and  implementation  of  generic  functions  builds  on  the  template 
idea  first  proposed  by  Stronstronp  [10]  for  C-|--|-  but  is  much  more  powerful  in  that 
it  allows  the  Implementor  more  control  over  how  much  new  source  code  is  generated. 
Such  control  cannot  be  created  easily  in  C-|--|-  without  extending  the  language  fur- 
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class  Set-ptr  { 

void  m(void*  e); 
void  add  {void*  e); 
boolean  subset{Set_ptr&  s); 


Figure  4.22:  Declarations  of  generic  Set  functions 


class  Set-Token  :  public  Set-ptr  { 

void  m (Token*  e)  {  5'et_ptr::m((void*)  e);  } 

void  a(/(/(Token*  e){  Set-ptr::add{{Yoid*)  e);  } 

boolean  subset{Set-Token&  s)  {  return  Set-ptr::subset{{Set-ptr&)s)',  } 

}: 

class  Set-String  :  public  Set-ptr  { 

void  m(String*  e)  {  5'eLptr::m((void*)  e);  } 

void  a(/(/(String*  e){  Set-ptr::add{{Yoid*)  e);  } 

boolean  subset  [Set -String&  s)  {  return  Set-ptr::subset{{Set-ptr&)s)]  } 


Figure  4.23:  Declarations  of  generic  Set  functions 

ther  than  is  suggested  in  the  latest  C++  language  reference  bv  Ellis  and  Stronstronp 

[10]. 

4.4.3  Refinements 

In  order  to  simplify  the  above  discussion  of  Therblig,  I  have  not  included 
the  enhancements  added  to  the  code  that  allows  Therblig  to  search  the  solution 
space  in  a  controlled  manner.  In  the  actual  implementation  (see  Appendix  B),  using  a 
mechanism  very  similar  to  that  used  in  the  Greedy  Sewing  Algorithm,  THERBLIG  can 
be  invoked  with  a  parameter  p  that  specifies  indirectly  the  portion  of  call  sites  that 
are  ‘optimally’  assigned  implementations;  i.e.,  the  search  for  their  implementations 
is  exhaustive,  with  every  possible  combination  of  implementations  examined. 

The  set  of  call  sites  is  sorted  in  decreasing  order  of  the  values  returned  by 
the  profiling  implementation’s  evaluation  functions.  Let  S  =  Ym  where  C{i)  is 

the  cost  estimate  returned  for  the  function  at  the  location  in  the  list.  The  sum 
of  these  values,  S.  is  multiplied  by  the  parameter  p,  a  number  between  0  and  1,  to 
determine  a  cutoff  point  in  the  list  of  sorted  call  sites.  For  a  call  site  at  location  i 
in  the  list  of  sorted  call  sites,  it  is  above  the  cutoff  point  if  Yj<i  C{j)  <  p  *  S,  and 
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it  is  below  the  cutoff  point  if  Y.jKi  C{j)  >=  p  *  S.  At  each  point  in  the  assignment 
algorithm,  if  the  call  site  being  considered  is  below  the  cutoff  point,  only  the  first 
consistent  implementation  is  returned,  and  all  others  are  ignored.  If  the  call  site  is 
above  the  cutoff  point,  then  each  consistent  implementation  is  examined  to  see  if  it 
improves  the  program’s  implementation. 

Invoking  Therblig  with  p  =  .9,  all  possible  implementations  are  examined 
for  the  call  sites  that  account  for  90%  of  the  estimated  runtime  resources.  By  setting 
p  =  1,  all  possible  implementations  are  examined.  The  results  and  methods  of 
the  algorithms  as  I  have  described  them  above  are  achieved  by  setting  p  =  0;  i.e., 
Therblig  returns  the  first  consistent  implementation  for  the  program. 


4.5  Examples 

I  have  demonstrated  that  TYPESETTER  code  is  not  difficult  to  write,  ei¬ 
ther  for  the  User  or  for  the  Implementor.  No  specialized  knowledge  of  compilers 
or  profiling  technology  is  required  by  either  the  User  or  the  Implementor.  The  Im¬ 
plementor  specifies  the  information  required  to  make  a  reasonable  implementation 
decision  with  normal-looking  programming  language  statements;  the  only  difference 
is  that  profiling  variables  are  allocated  per  call  site  rather  than  per  function.  In  an 
ideal  implementation  of  TYPESETTER,  the  User  would  need  only  (1)  to  re-compile 
the  system  as  directed  by  the  system  (although  this  cycling  could  certainly  be  auto¬ 
mated),  and  (2)  to  be  aware  of  the  different  kinds  of  optional  information  that  may 
be  specified  for  a  data  type  (e.g.,  upper  and  lower  bounds  on  elements  of  sets). 

Given  profile  data  and  User  declarations,  TYPESETTER  gives  the  User  pro¬ 
gram  a  ‘reasonable’  implementation.  I  cannot  claim  that  TYPESETTER  constructs 
‘optimal’  implementations:  the  whole  process  of  software  construction  is  too  heuris¬ 
tic  to  allow  such  a  claim.  Future  work  can  concentrate  on  determining  exactly  how 
‘optimal’  an  implementation  of  a  User’s  program  is  possible.  For  this  exploratory 
work,  I  have  concentrated  on  demonstrating  that  the  implementations  chosen  are 
not  ‘wrong’,  that  is,  that  TYPESETTER  chooses  an  implementation  for  an  abstract 
data  type  that  a  human  programmer  would  agree  is  a  reasonable  candidate. 

To  convince  the  reader.  I  will  present  some  results  using  three  examples 
to  demonstrate  TYPESETTER’S  flexibility.  The  first  example  is  a  small  program 
(approximately  60  lines)  that  is  useless  except  to  the  extent  it  displays  some  of 
the  capabilities  of  TYPESETTER.  The  second  example  is  an  implementation  of  the 
MINOPT  algorithm  presented  in  section  2.2.2.  Finally,  we  will  look  at  TYPESETTER 
itself,  and  examine  how  it  chooses  its  own  implementation.  The  TYPESETTER  pro¬ 
totype  has  nine  implementations  spread  among  the  three  abstractions  Set,  List,  and 
Map.  Set  has  five  implementations,  and  the  other  two  have  two  apiece.  Since  Sets 
have  more  possibilities  than  the  other  two  ADTs,  we  will  concentrate  on  showing 
how  Typesetter  performs  on  variables  declared  to  be  sets  of  User-defined  objects. 

There  are  two  distinct  questions  that  the  prototype  was  designed  to  answer. 
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The  first  is  to  test  our  hypothesis  that  a  greedy  assignment  algorithm  works  well.  To 
recap,  the  implementation  assignment  algorithm  used  in  TYPESETTER  sorts  the  call 
sites  in  decreasing  order  of  importance  (where  importance  is  estimated  by  evaluation 
functions  provided  by  the  Implementor),  assigns  the  most  efficient  implementation  to 
the  first  call  site,  and  then,  in  decreasing  order  of  importance,  assigns  to  all  other  call 
sites  the  most  efficient  implementation  that  is  consistent  with  previous  assignments. 
We  want  to  know  how  quickly  an  initial  assignment  of  implementations  is  made,  and 
how  close  that  assignment  is  to  the  ‘optimal’  solution,  assuming  that  the  performance 
estimates  returned  by  the  evaluation  functions  is  accurate. 

The  second  question  is:  how  accurate  are  the  estimates  returned  by  the 
Implementor’s  evaluation  functions?  Or,  in  other  words,  how  closely  does  the  final 
performance  of  the  implemented  program  correlate  with  the  predictions  made  by  the 
Implementor’s  evaluation  functions? 


4.5.1  Small  example 

Figure  4.24  contains  the  TYPESETTER  code  for  an  example  program  that 
constructs  three  sets  of  integers.  Given  that  the  declarations  of  the  variables  contain 
some  optional  declarations  that  tell  TYPESETTER  the  sets  really  are  sets  of  integers, 
it  is  not  too  surprising  that  TYPESETTER  selects  a  bit-mapped  implementation  for 
them  over  a  linked-list  implementation.  The  primary  point  of  this  example,  however, 
is  Typesetter’s  ability  to  share  code  between  instantiations  of  abstract  types. 
Even  though  the  set  cSet  has  more  elements  than  do  sets  aSet  and  hSet,  they  all 
three  have  few  enough  base  elements  that  they  can  fit  in  a  32-bit  word,  hence  they 
will  all  use  the  same  implementation  of  one-word  bitmaps.  However,  because  aSet 
and  hSet  were  declared  to  be  sets  of  short  integers,  they  will  share  a  coercion  class 
that  is  different  from  cSefs. 

Figure  4.25  shows  the  order  of  priority  given  to  the  call  sites  of  our  example 
program  based  on  estimates  provided  by  the  profiling  implementation’s  evaluation 
routines.  Each  line  shows  the  name  of  the  function  being  invoked,  the  line  in  the  file 
where  the  invocation  occurs,  the  index  in  the  profiling  array  for  this  file  (there  is  one 
for  each  source  file  making  up  a  program),  and  the  values  of  the  profiling  variables. 
Finally,  the  “profiling  costs”  (actually  an  estimate  of  the  cost)  is  given.  The  call 
sites  are  sorted  in  decreasing  order  based  on  those  estimates. 

We'll  look  closely  at  the  information  printed  for  the  Set__add  function  on 
line  35  of  our  test  program  (the  line  numbers  are  not  contiguous  due  to  some  ir¬ 
relevant  material  not  included).  It  has  seven  profiling  variables  corresponding  to 
the  numbers  given  in  parentheses:  p_cnt,  p^szA,  p^appended,  p^prepended,  p^wasln. 
pjnserted,  and  pJookedAt.  Respectively,  they  count  the  number  of  times  this  call  site 
was  executed  (p_cnt=8),  the  sum  of  the  size  of  the  set  at  each  invocation  {p^szA=‘28) , 
the  number  of  times  the  element  could  be  appended  to  a  list  in  which  the  ele¬ 
ments  of  the  set  were  sorted  by  their  memory  address  {p-appended=8) ,  or  prepended 
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DECLARE(aSet ,  Set,  short,  ObjsAreInts,  upperb=15,  lowerb=0) 
DECLARE(bSet ,  Set,  short,  ObjsAreInts,  upperb=15,  lowerb=0) 
DECLARE(cSet ,  Set,  int,  ObjsAreInts,  upperb=31,  lowerb=0) ; 

mainO 

{ 

int  i; 

for  (i  =  0;  i  <  16;  i++)  { 
if  ((i  &  2)  !=  0)  { 

Set_add(aSet , i) ; 

} 

} 

Set_add(aSet, 1) ; 

Set_add(aSet, 10) ; 
for  (i  =  0;  i  <  16;  i++)  { 
if  ((i  &  4)  !=  0)  { 

Set_add(bSet , i) ; 

} 

} 

// 

for  (i  =  0;  i  <  32;  i++)  { 
if  ((i  &  15)  ==  15)  { 

Set_add(cSet , i) ; 

} 

} 

Set_intersectl (bSet ,  aSet); 

// 

//  for  every  integer  in  the  set  c 

// 

for  (int  j=0;  j  <  LOOPSIZE;  j++)  { 
forAlKi,  cSet, 

if  (j  ==  0)  { 

cout  <<  "Found  "  <<  i  <<  "\n"; 

} 

); 

} 


Figure  4.24:  Small  example 


Sorted  call  sites: 

Set _ intersectl  (line  60  file  impltest  p  [40] =(1 , 8, 9, 13,4,4) 

profiling  costs  72) 

Set _ add  (line  35  file  impltest  p  [5]  =  (8,28, 8, 0, 0, 0, 28)  profiling  costs  36) 

Set _ add  (line  45  file  impltest  p  [26] =(8,28,8,0,0,0,28) 

profiling  costs  36) 

Set _ add  (line  38  file  impltest  p  [12] =(1 , 8, 0, 1 , 0, 0, 0)  profiling  costs  9) 

Set _ add  (line  39  file  impltest  p  [19] =(1 , 9, 0, 0, 1 , 0, 5)  profiling  costs  9) 

Set _ add  (line  54  file  impltest  p  [33] =(2, 1 , 2, 0, 0, 0, 1)  profiling  costs  3) 

Set _ iterate  (line  72  file  impltest  p  [47] =(3,6)  profiling  costs  3) 

Set  (line  25  file  impltest  p[0]=(l)  profiling  costs  1) 

Set  (line  26  file  impltest  p[l]=(l)  profiling  costs  1) 

Set  (line  27  file  impltest  p[2]=(l)  profiling  costs  1) 

Set _ iterinit  (line  72  file  impltest  p[46]=(l)  profiling  costs  1) 

Set _ iterCleanup  (line  72  file  impltest  p[49]=(l)  profiling  costs  1) 


Figure  4.25:  The  call  sites  sorted  by  profiling  estimates  of  importance 

{p^prepended=0) ,  the  number  of  times  the  element  being  added  was  already  in  the  set 
{p^wasln=0),  the  number  of  times  the  element  being  added  had  to  be  inserted  into 
the  interior  of  a  list  sorted  by  address  {pjnserted=0),  and  the  sum  of  the  number  of 
elements  that  had  to  be  examined  over  all  calls  to  this  functions  {pJookedAt=‘28) . 
The  actual  code  for  the  profiling  implementation  for  sets  is  given  in  Figure  4.26, 
from  which  we  can  see  that  the  profiling  implementation  also  keeps  the  elements  of 
the  set  on  a  list  sorted  by  their  memory  address. 

From  Figure  4.25,  we  can  see  that  the  profiling  evaluation  routines  consider 
the  intersection  operation  on  line  60  to  be  the  dominating  factor  in  this  program, 
giving  it  a  weight  (72)  twice  the  nearest  competitor  (the  two  adds,  weight  36).  Since 
the  intersection  function  has  aSet  and  bSet  as  parameters,  assigning  an  implementa¬ 
tion  to  the  intersection  function  on  line  60  will  also  assign  implementations  to  those 
two  variables.  The  evaluation  functions  for  the  four  possible  implementations  of  sets 
produced  the  estimates  in  Figure  4.27  for  the  intersect  function.  The  bitmapped- 
word  implementation  is  the  cheapest,  while  the  most  expensive  implementation  is 
the  one  that  keeps  the  elements  on  a  sorted  list  (in  this  case,  the  list  is  sorted  by 
the  values  of  the  integers):  apparently,  the  number  of  elements  in  the  two  sets  is 
not  sufficient  to  pay  for  the  extra  overhead  of  keeping  the  lists  sorted.  Therefore, 
the  intersection  function  on  line  60  was  assigned  Set^bmwrdJntersectl,  the  single 
word  bit-map  implementation  for  sets  whose  base  size  is  less  than  or  equal  to  32. 
Once  this  implementation  for  the  function  is  decided  upon,  then  the  arguments 
to  intersectl  (aSet  and  bSet)  will  be  assigned  types  corresponding  to  the  formal 
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void 

FUNCTION(add) (  Any  e) 

{ 

Set_P_Link  *lp  =  firstp; 

Set_P_Link  **bp  =  fefirstp; 

p_cnt++; 

p_szA  +=  len; 

if  (Ip  !=  nil  &&  e  <  lp->data)  { 
p_prepended++; 

} 

else  { 

while  (Ip  !=  nil  &&  e  >  lp->data)  { 
bp  =  &lp->next; 

Ip  =  lp->next; 
p_lookedAt++; 

} 

if  (Ip  ==  nil)  p_appended++; 
else  if  (e  ==  lp->data)  p_wasln++; 
else  p_inserted++; 

} 

if  (Ip  ==  nil  I  I  e  <  lp->data)  { 

Set_P_Link  *tp  =  new  Set_P_Link; 
tp->data  =  e; 
tp->next  =  Ip; 

*bp  =  tp; 
len++; 

} 

assert(p_cnt  ==  p_prepended  +  p_appended  +  p_wasln  +  p_inserted); 


Figure  4.26:  The  actual  profiling  implementation  for  the  add  function  for  Sets 


Callsite(40) 

Callsite(40) 

Callsite(40) 

Callsite(40) 


Set_bmwrd: :intersectl _ =1.5 

Set_slist: :intersectl _ =68.1125 

Set_bmarr: :intersectl _ =5.9 

Set_slistord : :intersectl _ =91.4 


Figure  4.27:  Estimates  of  the  cost  of  the  intersection  function 
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Callsite (40) 
Callsite (40) 
Callsite (40) 
Callsite (40) 


Set_bmwrd: : add _ =12 

Set_slist : : add _ =23 . 4 

Set_bmarr : : add _ =16 . 8 

Set_slistord : :add _ =27.6 


Figure  4.28:  Estimates  of  the  cost  of  the  add  function  on  line  54 

types  of  SetJ)mwrdJntersectl,  which  in  this  case  is  also  SetJ)mwrd.  Onr  prototype 
has  only  one  implementation  of  the  intersect  function  with  two  parameters  defined. 
Typesetter  is  designed  to  allow  as  many  implementations  as  Implementors  may 
deem  usable  in  various  situations.  This  is  easily  incorporated  into  TYPESETTER 
because  we  concentrate  on  assigning  implementations  to  functions:  in  other  words, 
implementations  of  variables  occurs  as  a  side  effect  of  assigning  implementations  to 
functions. 

After  assigning  an  implementation  to  the  intersection  function,  the  only 
variable  implementation  remaining  to  be  decided  is  that  of  cSet.  Since  it  does  not 
interact  with  either  aSet  or  bSet  in  a  function  call,  its  implementation  is  independent 
of  theirs.  The  call  on  Set^^add  on  line  54  of  the  program  is  the  most  important 
function,  according  to  the  profiling  implementation’s  estimates.  Figure  4.28  gives  the 
implementations’  estimates  of  the  cost  of  calling  their  respective  versions  of  the  the 
Set__add  function.  Therefore,  cSet  is  also  assigned  the  word  bitmap  implementation. 
Figure  4.29  shows  TYPESETTER’S  output,  specifying  the  types  of  the  variables  of 
onr  small  program  based  on  these  considerations.  The  specifications  are  interpreted 
as  follows: 

INSTANTIATEd ,M,P.  . .  )  Create  source  code  for  implementation  I  (name  the  class 
N)  with  parameters  P.  In  onr  example  in  Figure  4.29,  the  instantiations  require 
two  parameters:  the  functions  for  taking  an  object  to  an  integer  and  the  inverse. 

COERCE ( I, N,C,P-  ■  ■  )  Create  a  coercion  class  C  which  converts  calls  on  the  func¬ 
tional  interface  of  implementation  I  into  the  instantiation  class  N,  using  the 
parameters  P. 

DECLARE_M(V,  C)  Declare  variable  V  to  be  of  type  (coercion  class)  C. 

In  the  sample  program  is  a  constant  that  determines  the  number  of  times 
the  loop  containing  the  iteration  over  the  elements  of  cSet  is  executed.  If  that  con¬ 
stant  is  set  to  ten,  instead  of  one,  then  the  profile  data  induces  TYPESETTER  to  make 
a  different  implementation  assignment  to  cSet.  Figure  4.30  contains  a  summary  of 
the  output  from  TYPESETTER,  emphasizing  the  differences  with  the  previous  run 
of  the  program.  The  intersection  function  is  still  the  most  important,  but  the  iter¬ 
ator  functions  have  moved  np  in  importance.  Again,  because  of  the  independence 
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INSTANTIATE (Set_bmwrd,  Set_bmwrd_int_int ,  int,  int)®; 

COERCE(Set_bmwrd,  Set_bmwrd_int_int ,  Set_bmwrd_int_int_of _int ,  int)®; 
DECLARE_M(cSet ,  Set_bmwrd_int_int_of _int ,  32)®; 

®; 

INSTANTIATE(Set_bmwrd,  Set_bmwrd_int_int ,  int,  int)®; 

COERCE(Set_bmwrd,  Set_bmwrd_int_int ,  Set_bmwrd_int_int_of _short ,  short)®; 
DECLARE_M(bSet ,  Set_bmwrd_int_int_of _short ,  16)®; 

®; 

INSTANTIATE (Set_bmwrd,  Set_bmwrd_int_int ,  int,  int)®; 

COERCE(Set_bmwrd,  Set_bmwrd_int_int ,  Set_bmwrd_int_int_of _short ,  short)®; 
DECLARE_M(aSet ,  Set_bmwrd_int_int_of _short ,  16)®; 


Figure  4.29:  TYPESETTER’S  assignment  of  types  to  the  program 


of  cSet  from  the  other  variables  in  the  program,  there  is  no  effect  except  on  cSefs 
implementation.  Now  its  most  important  call  site  is  the  call  on  Set_Jterate,  and 
the  various  implementations’  estimates  of  cost  are  shown  in  Figure  4.30.  Set_slist 
and  Set_slistord  evaluate  the  same  since  they  are  both  linked-list  implementations, 
differing  only  in  the  order  in  which  the  elements  of  the  set  are  returned.  Selecting  be¬ 
tween  them  more  or  less  at  random  results  in  assigning  the  Set^slist  implementation 
to  cSet. 


The  above  implementations  were  assigned  by  Therblig  withp  =  0;  that  is, 
the  implementations  chosen  were  the  first  consistent  set  of  implementations.  While 
a  reasonable  argument  can  be  made  for  the  implementations  that  were  selected,  two 
question  still  remain.  Are  they  the  best  implementations  possible,  given  the  results 
of  the  evaluation  functions?  And  does  the  performance  of  the  program  improve? 

To  be  able  to  answer  the  second  question,  we  have  to  have  a  program 
that  requires  a  non-trivial  amount  of  time.  To  that  end,  we  modify  our  LOOPSIZE 
macro  to  100.000.  To  answer  the  first  question,  we  run  Therblig  on  the  program 
with  p  =  1;  i.e.  all  possible  assignments  of  implementations  are  evaluated.  On 
this  small  example,  it  made  no  difference:  an  exhaustive  search  across  all  possible 
implementations  of  the  program  still  assigned  Set_slist  to  cSet,  and  Set_bmwrd  to 
aSet  and  bSet. 

Table  4.1  shows  the  various  running  times  of  our  small  example  program 
when  cSet  is  implemented  with  each  of  the  possible  implementations  for  Set.  From  it, 
we  can  see  that  THERBLIG  correctly  chose  Set^slist  as  one  of  the  best  implementations 
possible  for  cSet.  Running  THERBLIG  with  p  =  .9  produced  exactly  the  same  result 
as  p  =  1,  corroborating  my  hypothesis. 
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Sorted  call  sites: 

Set _ intersectl  (line  60  file  impltest  p  [40] =(1 , 8, 9, 13,4,4) 

profiling  costs  72) 

Set _ add  (line  35  file  impltest  p  [5]  =  (8,28, 8, 0, 0, 0, 28)  profiling  costs  36) 

Set _ add  (line  45  file  impltest  p  [26] =(8,28,8,0,0,0,28) 

profiling  costs  36) 

Set _ iterate  (line  72  file  impltest  p  [47] =(30 , 60)  profiling  costs  30) 

Set _ iterinit  (line  72  file  impltest  p[46]=(10)  profiling  costs  10) 

Set _ iterCleanup  (line  72  file  impltest  p[49]=(10)  profiling  costs  10) 

Set _ add  (line  38  file  impltest  p  [12] =(1 , 8, 0, 1 , 0, 0, 0)  profiling  costs  9) 

Set _ add  (line  39  file  impltest  p  [19] =(1 , 9, 0, 0, 1 , 0, 5)  profiling  costs  9) 

Set _ add  (line  54  file  impltest  p  [33] =(2, 1 , 2, 0, 0, 0, 1)  profiling  costs  3) 

Set  (line  25  file  impltest  p[0]=(l)  profiling  costs  1) 

Set  (line  26  file  impltest  p[l]=(l)  profiling  costs  1) 

Set  (line  27  file  impltest  p[2]=(l)  profiling  costs  1) 

Callsite (58) :  Set_bmwrd : : iterate _ =377 . 830 

Callsite (58) :  Set_bmarr : : iterate _ =367 .916 

Callsite(58) :  Set_slist : : iterate _ =105 

Callsite(58) :  Set_slistord : literate _ =105 

INSTANTIATE(Set_slist,  Set_slist)®; 

C0ERCE(Set_slist ,  Set_slist,  Set_slist_of _int ,  int)@; 

DECLARE_M(cSet ,  Set_slist_of _int)(§; 

INSTANTIATE(Set_bmwrd,  Set_bmwrd_int_int ,  int,  int)(§; 

COERCE (Set_bmwrd,  Set_bmwrd_int_int ,  Set_bmwrd_int_int_of _short ,  short)®; 
DECLARE_M(bSet ,  Set_bmwrd_int_int_of _short ,  16)®; 

®; 

INSTANTIATE(Set_bmwrd,  Set_bmwrd_int_int ,  int,  int)®; 

COERCE (Set_bmwrd,  Set_bmwrd_int_int ,  Set_bmwrd_int_int_of _short ,  short)®; 
DECLARE_M(aSet ,  Set_bmwrd_int_int_of _short ,  16)®; 


Figure  4.30:  Results  from  the  example  with  LOOPCOUNT=  10 

Set_slist  2.30s 
Set_slistord  2.30s 
Set_bmwrd  6.32s 
Set_bmarr  7.63s 


Table  4.1:  Small  example  running  times  with  various  implementation  assignments 
for  cSet 
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4.5.2  MINOPT 


Appendix  A  contains  TYPESETTER  code  for  an  implementation  of  the 
AflNOPT  algorithm  discussed  in  section  2.2.2.  This  program,  we’ll  call  it  minopt, 
has  three  set  variables,  two  list  variables,  and  one  map.  The  map  is  a  dictionary 
mapping  tokens  onto  node  and  arc  names.  In  order  to  compute  execution  frequen¬ 
cies  of  a  graph  object  (arc  or  node)  that  is  not  instrumented,  each  non-instrmuented 
object  luaintains  a  list  of  other  graph  objects  froiu  which  its  execution  count  is  coiu- 
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Graph 

gozintas 

gozoutas 

time 

Set_bmarr 

Set_slist 

Set_slist 

2.50s 

Set  j^list 

Set_slist 

Set_slist 

2.92s 

Set  jilistord 

Set_slistord 

Set_slistord 

3.37s 

Set_bmarr 

Set_bmarr 

Set_bmarr 

3.79s 

Table  4.2:  Running  times  for  the  K-S  algorithm. 


pnted;  this  is  class  member  variable  flowUst  in  class  GraphObj^obj.  The  other  list  is 
a  sorted  list  of  all  graph  objects  (sortedObjList). 

The  remaining  three  set  variables  are:  Graph,  the  set  of  all  objects  (nodes 
and  arcs)  that  comprise  the  current  graph;  Node_obj::gozintas,  the  set  of  all  arcs  that 
enter  a  node;  and  Node_obj::gozoutas,  the  set  of  all  arcs  that  exit  a  node. 

Profile  data  was  generated  by  running  two  PFGs  through  minopt,  one  is 
a  small  five-node  graph  that  Knnth  and  Stevenson  used  as  an  example  in  their 
paper  [30],  and  the  other  is  the  graph  in  Figure  2.3.  Based  on  that  profile  data, 
THERBLIG.with  p  =  0,  selected  SeGbmarr  for  the  variable  Graph,  and  Set_slist  for 
the  two  arc  lists,  gozintas  and  gozoutas.  This  seems  a  reasonable  assignment  of 
implementations,  since  Graph  is  added  to  and  iterated  over,  but  nothing  else.  Since 
it  is  a  completely  full  set,  there  are  no  penalties  to  pay  in  a  bitmap  implementation 
for  having  to  check  bits  in  that  map  that  aren’t  set.  This  is  not  the  case  for  the 
gozintas  and  gozoutas  variables:  the  number  of  arcs  coming  into  or  leaving  an  arc 
is  never  more  than  three  in  our  example  graphs;  a  linked  list  would  do  much  better 
for  these  two  variables. 

With  p  =  I,  Therblig  makes  exactly  the  same  choices,  again  in  support 
of  the  hypothesis  that  implementation  decisions  made  early  are  close  to  the  ‘op¬ 
timal’.  Table  4.2  are  minopVs  running  times  when  the  variables  are  assigned  as 
shown.  The  input  data  is  a  364-node  graph  made  by  replicating  and  concatenating 
the  graph  in  Figure  2.3.  The  first  entry  in  the  table  uses  the  implementations  cho¬ 
sen  by  Therblig,  and  the  remainder  show  that  it  was  indeed  a  reasonable  set  of 
implementations. 

4.5.3  Implementing  THERBLIG 

Therblig  is  the  analysis  software  for  the  TYPESETTER  system.  From  the 
descriptions  of  the  available  abstractions  and  their  implementations,  and  the  descrip¬ 
tion  of  the  User’s  program,  it  selects  implementations  for  the  variables  declared,  and 
functions  invoked,  in  the  User’s  program.  THERBLIG  is  the  most  complex  software 
written  in  TYPESETTER  and.  therefore,  will  be  our  next  example. 

Therblig  consists  of  over  8500  lines  of  TYPESETTER  code  and  comments. 
This  includes  almost  2500  lines  of  TYPESETTER  code  for  the  analysis  portion  of  the 
software,  with  the  other  6000  lines  taken  up  by  the  nine  implementations  of  the  three 
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abstractions  of  Sets,  Lists,  and  Maps.  Sets  has  five  implementations,  including  the 
profiling  implementation,  while  Lists  and  Maps  have  two  apiece. 


In  the  body  of  Therblig,  there  are  23  variables  utilizing  these  abstractions: 
four  are  Lists,  seven  are  Maps,  and  eleven  are  Sets.  Given  that  Sets  have  a  more 
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complete  set  of  implementations  than  do  the  other  abstractions  we  will  look  at  how 
they  are  implemented  by  TYPESETTER.  The  complete  listing  of  Therblig  is  given 
in  Appendix  B  for  reference. 

The  eleven  Set  variables  in  Therblig  are: 

ADTcalls:  The  set  of  all  call  sites  in  the  User’s  program. 

Ic:  A  formal  parameter  to  the  function  findCompatibleImplementations  (see  page  81). 

adt_afcns:  A  member  of  the  class  ADType-obj  representing  abstract  data  types;  it 
is  the  set  of  all  functions  that  define  the  interface  to  the  abstraction. 

adtafJmpUcns:  A  member  of  the  class  ADTabsFcn_obj  representing  the  abstract 
functions,  each  of  which  will  have  a  set  of  functions  that  are  its  implementa¬ 
tions;  this  is  that  set  of  implementation  functions. 

callSites:  A  variable,  local  to  the  function  implementable,  containing  all  call  sites 
that  have  a  specific  variable  in  their  argument  lists. 

callSitesp:  A  foriual  paraiueter  which  contains  all  call  sites  with  a  specific  variable 
in  their  argument  lists. 

changed:  A  local  variable  (in  function  assignable)  which  keeps  track  of  variables 
which  have  been  given  tentative  assignments. 

changedp:  A  foriual  parameter  which  keeps  track  of  variables  which  have  been  given 
tentative  assignments. 

implSet:  A  local  variable  to  function  assignable  that  keeps  track  of  all  function 
implementations  that  are  compatible  with  the  current  state  of  assignments 
and  a  particular  call  site. 

ivars:  A  formal  parameter  to  undoimplementations  that  is  a  set  of  variables  whose 
tentative  assignments  are  to  be  undone. 

vdJnSigsOf:  A  member  of  the  class  VarDecLobj  that  contains  the  set  of  all  call 
sites  which  have  this  variable  as  an  actual  argument. 

asj^et:  Each  variable  used  as  a  parameter  to  a  User-defined  function  will  be  aliased 
to  other  variables;  this  is  the  set  of  aliases  for  a  variable.  It  is  a  member  of  the 
class  AliasSeFobj. 

I  have  not  attempted  to  solve  the  problem  of  detecting  input-dependent 
behavior:  that  is  still  up  to  Users  to  realize  about  their  own  programs,  and  to  take 
appropriate  actions,  specifically  to  run  the  prograius  a  sufficient  number  of  tiiues 
with  typical  input  data  to  cover  all  the  iiuportant  behaviors  of  their  prograius.  The 
problem  of  determining  when  sufficient  ‘typical’  input  data  has  been  utilized  is  also 
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beyond  the  scope  of  this  dissertation.  Therblig  is  interesting  because  it  exhibits 
input-dependent  behavior:  if  there  is  no  profiie  data  for  the  User’s  program,  it 
seiects  oniy  profiiing  impiementations  for  the  program.  Because  this  is  a  straight¬ 
forward  procedure  that  does  not  require  much  coiuputation,  luuitipie  profiiing  runs 
of  Therblig  are  required  to  insure  that  the  profiiing  data  is  representative  of  the 
average  behavior  of  the  prograiu.  Once  profiie  data  exists  for  THERBLIG  to  anaiyze, 
then  the  more  extensive  anaiysis  described  in  section  4.4.1  is  executed  to  deteriuine 
an  impiementation  for  the  prograiu  under  consideration.  Looking  at  how  THERBLIG 
modifies  its  idea  of  a  good  seiection  wiii  give  us  some  insight  into  how  it  works. 

The  first  step  is  to  buiid  a  version  of  THERBLIG  with  aii  variabies  impie- 
mented  by  the  profiiing  versions  of  their  underiying  abstraction.  The  second  step  is 
to  run  Therblig  feeding  it  its  own  source  code.  Since  there  is  not  yet  any  profiie 
data,  this  run  simpiy  reads  the  program  description,  and  writes  a  specification  hie 
giving  aii  of  its  variabies  a  profiiing  impiementation:  this  is  more-or-iess  a  do-nothing 
run  of  Therblig. 

The  third  step  is  to  run  THERBLIG  again,  but  this  time  there  is  profiie  data 
generated  from  its  first  run.  Figure  4.31  shows  the  caii  sites  of  aii  Set  interface  func¬ 
tions  sorted  in  the  order  indicated  by  the  Set  profiiing  impiementation’s  evaiuation 
functions  using  the  profiie  data  generated  on  the  first  run.  There  were  a  totai  of  208 
caii  sites  in  the  THERBLIG  sources.  174  of  which  are  in  the  main  code.  Of  these  174, 
oniy  53  invoive  caiis  on  Set  interface  functions.  (Oniy  the  non-zero  cost  caii  sites 
that  invoke  functions  in  the  Set  abstraction  are  shown  in  the  figures.) 

The  Set  profiiing  impiementation’s  evaiuation  functions  estimate  that  the 
caii  site  on  iine  761,  which  constructs  Set  objects,  is  possibiy  the  greatest  bottieneck 
in  the  program,  with  an  invocation  of  add  traiiing  a  very  ciose  second.  In  generai, 
the  profiiing  impiementations’  evaiuation  functions  attempt  to  estimate  the  potential 
impact  of  the  function  at  a  particuiar  caii  site — it  is  a  worst  case  evaiuation.  In 
the  case  of  the  first  add  on  the  list,  if  the  implementation  chosen  were  a  iinked- 
iist  implementation,  then  adding  an  element  could  mean  having  to  examine  aii  of 
the  current  members  looking  for  duplication;  hence  the  large  estimate.  Given  the 
number  of  times  the  add  function  on  line  1078  was  called  {p^cnt=  208),  and  the 
sum  of  the  set  sizes  across  those  calls  {p-szA=  21528),  we  can  estimate  the  average 
size  of  the  set  for  each  call  to  be  103.5  (and  we  note  that  208/2  =  104).  Of  these 
208  calls,  exactly  p-appended=  208  elements  were  appended  to  the  address-sorted 
list,  and  p_prepended=  0  were  prepended.  There  were  no  elements  already  in  the 
set  {p_wasln=  0),  and  no  elements  had  to  be  inserted  in  the  list  {p_inserted=  0). 
Because  all  of  the  elements  were  appended  to  the  list,  then  pJookedAt=  21528. 

Based  on  this  data,  THERBLIG  assigned  the  implementations  to  the  vari¬ 
ables  as  shown  in  Table  4.3.  first  column.  (Only  the  Set  variables  are  shown.) 

The  next  step  is  to  run  THERBLIG  a  third  time.  The  last  run  added  to  the 
existing  profile  data,  and  did  so  while  executing  code  that  it  did  not  execute  during 
the  first  run.  The  question  naturally  arises  as  to  whether  this  would  change  the  as- 
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Sorted  call  sites: 

Set  (line  761  file  main  p[88]=(685)  profiling  costs  21920) 

Set__add  (line  1078  file  main  p [176] =(208,21528,208, 0 , 0 , 0 ,21528) 

profiling  costs  21736) 

Set__add  (line  1073  file  main  p [168] =(213,2407, 211 , 0, 2, 0, 2405) 

profiling  costs  2618) 

Set  (line  318  file  main  p[15]=(276)  profiling  costs  2208) 

Set  (line  302  file  main  p[9]=(239)  profiling  costs  1912) 

Set  (line  330  file  main  p[21]=(119)  profiling  costs  952) 

Set _ add  (line  763  file  main  p[89]=(685, 0,685, 0,0, 0,0) 

profiling  costs  685) 

Set _ add  (line  827  file  main  p [103] =(56 ,519 ,56, 0, 0, 0, 519) 

profiling  costs  575) 

Set  (line  310  file  main  p[12]=(51)  profiling  costs  408) 

Set  (line  610  file  main  p[85]=(56)  profiling  costs  392) 

Set__add  (line  882  file  main  p [121] =(114, 126, 114, 0, 0, 0 , 126) 

profiling  costs  240) 

Set  (line  487  file  main  p[63]=(4)  profiling  costs  28) 

Set _ unionl  (line  769  file  main  p  [96] =(4, 5, 4, 5, 0)  profiling  costs  20) 

Set  (line  130  file  main  p[3]=(l)  profiling  costs  8) 

Set _ iterate  (line  389  file  main  p[33]=(8,8)  profiling  costs  8) 

Set _ iterinit  (line  389  file  main  p[32]=(4)  profiling  costs  4) 

Set _ iterCleanup  (line  389  file  main  p[35]=(4)  profiling  costs  4) 


Figure  4.31:  The  sorted  call  sites  of  Set  functions  from  one  Therblig  run 
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Variable  Assignment  1  Assignment  2  Assignment  3 


ADTcalls 

Set_bmarr 

Set_bmarr 

Set_bmarr 

Ic 

Set_biuarr 

* 

Set_slist 

Set_slist 

adt_afcns 

Set_biuarr 

Set_biuarr 

Set_bmarr 

adtafJiupUcns 

Set_biuarr 

* 

Set_slist 

Set_slist 

asjict 

Set_slist 

Set_slist 

Set_slist 

callSites 

Set_bmarr 

Set_bmarr 

Set_bmarr 

callSitesp 

Set_bmarr 

Set_bmarr 

Set_biuarr 

changed 

Set_biuarr 

* 

Set_slist 

Set_slist 

changedp 

Set_biuarr 

* 

Set_slist 

Set_slist 

implSet 

Set_bmarr 

* 

Set_slist 

Set_slist 

ivars 

Set_bmarr 

* 

Set_slist 

Set_slist 

vdJnSigsOf 

Set_bmarr 

Set_bmarr 

Set_biuarr 

Table  4.3:  Variable  assignments  based  on  profileof  three  rims  of  Therblig  with 

p  =  0 

signments  of  implementations  to  variables.  And  indeed  it  does,  as  the  second  column 
of  Table  4.3  shows.  The  changed  implementations  are  marked  with  an  asterisk. 

The  third  column  of  Table  4.3  shows  the  assignments  when  Therblig  is 
run  for  a  fourth  time.  But  by  now  the  statistics  have  stabilized,  and  the  assignments 
do  not  change. 

All  of  the  rims  of  THERBLIG  above  were  with  p  =  0.  Table  4.4  shows  the 
results  of  running  Therblig  with  p  =  1.  The  asterisk  beside  the  entry  for  the  first 
assignment  means  that  the  first  choice  with  p  =  1  differed  from  the  first  choice  when 
p  =  0.  The  asterisks  in  later  columns  means,  as  before,  that  THERBLIG  changed  the 
implementation  based  on  more  profile  data.  Again,  the  assignments  have  stabilized 
by  the  third  run. 

There  are  only  three  differences  between  the  selections  made  when  p  = 
0  and  p  =  1:  the  variables  callSites,  callSitesp,  and  vdJnSigsOf  were  formerly 
Set^bmarr,  a  bit  mapped  array.  Looking  at  all  possible  combinations  of  assign¬ 
ment  resulted  in  those  implementations  being  changed  to  a  Set^slistord,  a  simple 
linked  list  that  keeps  its  member  in  the  order  of  their  memory  addresses. 

And  finally.  Table  4.5  shows  the  implementations  selected  when  running 
with  p  =  .9.  The  important  point  to  note  here  is  that  running  with  p  =  .9  means 
that  only  about  thirty  out  of  208  call  sites  (about  15%)  are  exhaustively  analyzed:  the 
remaining  170-some-odd  call  sites  are  assigned  the  first  consistent  implementation 
found.  The  asterisks  in  the  first  column  indicate  that  the  first  assignment  differs  from 
the  first  assignment  in  Table  4.4.  Asterisks  in  later  columns  highlight  differences 
from  the  preceding  coluiun.  It  is  interesting  to  note  that  the  assigniuents  had  not 
stabilized  by  the  third  assignment.  I  did  not  deteriuine  how  luany  iterations  p  =  .9 
would  have  required  to  stabilize. 
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Variable  Assignment  1  Assignment  2  Assignment  3 


ADTcalls 

Set_bmarr 

Set_bmarr 

Set_bmarr 

Ic 

Set_bmarr 

* 

Set_slist 

Set_slist 

adt_afcns 

Set_bmarr 

Set_bmarr 

Set_bmarr 

adtafJmpUcns 

* 

Set_bmarr 

* 

Set_slist 

Set_slist 

as^set 

Set  j^list 

Set_slist 

Set_slist 

callSites 

* 

Set  jilistord 

Set_slistord 

Set_slistord 

callSitesp 

* 

Set  jilistord 

Set_slistord 

Set_slistord 

changed 

Set_bmarr 

* 

Set_slist 

Set_slist 

changedp 

Set_bmarr 

* 

Set_slist 

Set_slist 

implSet 

Set_bmarr 

* 

Set_slist 

Set_slist 

ivars 

Set_bmarr 

* 

Set_slist 

Set_slist 

vdJnSigsOf 

* 

Set  jilistord 

Set_slistord 

Set_slistord 

Table  4.4:  Variable  assignments  based  on  the  profile  of  three  rims  of  Therblig  with 

p  =  1 


Variable  Assignment  1 


ADTcalls 

Set  Jomarr 

Ic 

Set  Jomarr 

adt_afcns 

Set  Jomarr 

adtafJmpUcns 

* 

Set_slist 

as_set 

Set_slist 

callSites 

* 

Set  Jomarr 

callSitesp 

* 

Set  Jomarr 

changed 

Set  Jomarr 

changedp 

Set  Jomarr 

implSet 

Set  Jomarr 

ivars 

Set  Jomarr 

vdJnSigsOf 

* 

Set  Jomarr 

Assignment  2  Assignment  3 


Set  Jomarr 

Set  Jomarr 

* 

Set_slist 

Set_slist 

Set  Jomarr 

Set  Jomarr 

Set_slist 

Set_slist 

Set_slist 

Set_slist 

SetJomarr  * 

Set_slist 

SetJomarr  * 

Set_slist 

* 

Set_slist 

Set_slist 

* 

Set_slist 

Set_slist 

* 

Set_slist 

Set_slist 

* 

Set_slist 

Set_slist 

SetJomarr  * 

Set_slist 

Table  4.5:  Variable  assignments  based  on  the  profile  of  three  rims  of  Therblig  with 
p  =  .9 
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1 

p  =  0 

34.92s 

2 

p  =  .9 

32.60s 

3 

p  =  1 

35.98s 

4 

Set_slist 

36.21s 

5 

Set_slistord 

36.36s 

6 

Set_bmarr 

152.17s 

7 

profiling 

44.73s 

Table  4.6:  Therblig  running  times  with  various  implementation  assignments 

To  measure  the  effectiveness  of  the  assignments,  I  examined  the  output  from 
Therblig  to  see  if  I  could  have  done  better  with  the  existing  implementations  of 
sets.  In  essence.  I  manually  performed  Low’s  search  heuristic:  perturbing  an  existing 
assignment  of  implementations  to  see  if  another  would  be  better.  For  THERBLIG, 
there  are  only  three  implementations  of  sets  that  are  feasible: 

Set  jilist:  A  simple  list,  with  a  single  link  to  successive  elements,  and  a  single  pointer 
to  the  first  element  of  the  list. 

Set_slistord:  A  singly-linked  list  as  for  Set_slist,  with  the  addition  of  a  pointer  to 
the  last  element  of  the  list,  and  the  elements  are  kept  on  the  list  in  the  order 
of  their  memory  addresses. 

Set_bmarr:  An  array  of  bits;  requires  functions  to  map  objects  to  integers  and 
integers  to  objects. 

The  SetJ)mwrd  implementation  is  not  feasible  since  all  sets  in  THERBLIG  have  more 
than  32  elements.  I  ran  seven  versions  of  THERBLIG:  one  that  uses  only  the  profiling 
implementations;  one  with  implementations  assigned  by  THERBLIG  running  with 
p  =  0  (assignment  3  from  Table  4.3);  one  with  implementations  assigned  with  p  =  1 
(assignment  3  from  Table  4.4);  one  with  implementations  assigned  with  p  =  .9 
(assignment  3  from  Table  4.5);  and  three  others  with  all  sets  assigned  Set-bmarr, 
Set-slist,  and  Setslistord.  The  timing  runs  had  p  =  1  to  exercise  THERBLIG  as  fully 
as  possible.  The  results  are  in  Table  4.6. 

Finally,  we  look  at  exactly  how  much  it  costs  ns  to  profile  THERBLIG. 
A  sense  of  the  cost  can  be  had  by  comparing  the  running  times  of  the  profiling 
implementation  vs.  the  Set_slist  implementation  in  Table  4.6.  The  slist  implemen¬ 
tation  of  sets  was  initially  derived  by  removing  all  profiling  code  from  the  profiling 
implementation.  From  this,  I  estimate  that  profiling  using  counters  in  a  special  im¬ 
plementation  slows  down  the  program  by  10-20%;  that  is,  it  runs  10-20%  slower  than 
it  would  if  all  the  counting  code  were  removed.  However,  in  some  cases,  this  will 
often  be  insignificant,  particularly  where  the  default  implementation  is  ill-suited  to 
the  program  being  profiled.  In  that  case,  the  major  slow  down  of  the  program  will 
be  due  to  algorithmic  unsuitability,  and  not  to  counting. 
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Chapter  5 
Conclusion 


For  the  last  twenty  years,  no  one  has  agreed  with  Knnth’s  Dictum  (see 
page  1)  enough  to  implement  the  idea,  nor  has  anyone  proven  the  assertion  false. 
There  are  several  assumptions  in  the  Dictum,  two  of  which  have  formed  the  central 
hypotheses  of  this  work.  They  are: 

•  that  profiling  can  be  done  efficiently  enough  so  as  not  to  be  perceived  as  onerous 
by  the  programmer;  and, 

•  that  compilers  and  other  tools  can  automatically  extract  useful  information 
from  profile  data. 

In  the  process  of  investigating  the  first  of  these  hypotheses,  I  determined 
that  an  implicit  assumption  held  by  many  programmers  is  false.  Most  programmers 
(myself  included)  have  believed  that  counting  executions  of  basic  blocks  is  sufficient 
and  more  efficient  than  getting  the  more  complete  information  about  arc  traversals. 
I  demonstrated  in  Chapter  2  that  this  simply  is  not  so.  I  presented  an  algorithm 
MINOPT  which  finds  the  ‘optimal’  instrumentation  of  a  program  by  automatically 
placing  the  instrumentation  code  in  the  nodes  or  on  the  arcs.  Previous  algorithms 
have  found  optimal  solutions  for  nodes,  or  for  arcs.  AdlNOPT  is  the  first  provably 
minimal  algorithm  for  both  nodes  and  arcs.  I  also  pointed  out  that  the  ‘optimal’ 
algorithms  aren’t,  that  all  have  assumed  the  ability  to  compute  instrumentation 
costs  in  linear  time.  I  do  not  know  whether  there  exists  such  an  algorithm  or  not, 
but  I  have  shown  that  if  it  does  exist,  it  cannot  be  ‘local’.  That  is,  when  estimating 
the  instrumentation  costs  of  a  node’s  incoiuing  and  outgoing  arcs,  luore  inforiuation 
is  required  than  just  the  execution  frequencies  of  that  node  and  its  arcs. 

My  lueasurements  showed  that  profiling  in  the  form  of  in-line  execution 
counts  imposes  anywhere  froiu  10%  to  20%  overhead.  This  can  be  predicted  solely 
from  the  observation  that  luost  prograius’  basic  blocks  average  froiu  four  to  ten 
instructions  in  size,  and  from  the  not  too  unrealistic  assumption  that  incrementing 
a  counter  in  memory  requires  about  the  average  number  of  cycles  for  the  execution 
of  an  instruction  on  a  machine.  Therefore,  putting  instrumentation  in  the  most 
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frequently  executed  basic  block  will  produce  a  slowdown  of  10-20%  for  the  program 
as  a  whole. 

Programmers  would  complain  if  a  programs  were  slowed  down  10-20%  for 
no  reason.  That  is,  if  no  one  (or  thing)  was  making  any  use  of  the  profile  data, 
then  programmers  would  turn  off  profile  collection.  (That  is  why  all  compilers  today 
require  the  programmer  to  specify  when  to  collect  profile  data.)  However,  the  second 
of  the  hypotheses  above  would  alleviate  the  problem  considerably.  If  some  part  of 
the  programming  system  were  able  to  utilize  the  profile  data  to  produce  superior 
programs,  then  the  profile  collecting  overhead  is  not  onerous.  It  is  comparable  to 
the  overhead  of  non-optimized  bonnds-checking  code.  While  there  has  been  some 
research  in  improving  the  overhead  of  profile  collection  (in  particular,  see  Sarkar’s 
paper  on  using  dependency  graphs  to  optimize  profile  counting  [42]),  there  has  yet 
to  be  a  definitive  exploration  of  the  optimization  of  profile  counting. 

For  there  to  be  such  research,  it  has  to  be  shown  that  continual  collection 
of  profile  data  is  a  win.  Therefore,  I  concentrated  in  Chapters  3  and  4  in  explor¬ 
ing  ways  a  compiler  might  make  use  of  profile  data.  In  Chapter  3  I  presented  an 
algorithm  I  call  Greedy  Sewing  for  improving  the  behavior  of  programs  on  machines 
with  instruction  caches.  By  physically  moving  basic  blocks  closer  together  that  are 
executed  close  together  in  time,  miss  rates  in  instruction  caches  can  be  reduced  up  to 
50%.  Profile  data  not  only  allows  the  compiler  to  know  which  basic  blocks  to  move 
closer  together,  it  also  allows  it  to  ignore  those  situations  where  it  will  not  matter 
to  the  final  performance  of  the  program. 

The  primary  contribution  of  this  work  is  the  development  of  a  program¬ 
ming  system  that  utilizes  profile  data  to  select  implementations  of  program  ab¬ 
stractions.  The  Typesetter  system  integrates  the  development,  evaluation,  and 
selection  of  alternative  implementations  of  programming  abstractions  into  a  package 
that  is  transparent  to  the  User.  Unlike  previous  systems,  TYPESETTER  does  not 
require  specialized  compiler  knowledge  of  the  User  or  the  Implementor.  From  the 
data  collected  so  far,  the  TYPESETTER  approach  to  system  synthesis  appears  to  be 
a  promising  avenue  of  research. 

5.1  Problems  and  future  work 

I  have  only  scratched  the  surface  of  the  body  of  engineering  problems  that 
need  to  be  solved  before  TYPESETTER  can  be  considered  a  complete  system.  Some 
of  these  are  related  to  problems  inherent  in  using  profile  data  to  predict  the  future 
performance  of  a  program,  but  others  are  related  to  the  specific  approach  taken  by 
Typesetter. 

Execution  counts:  During  this  work,  I  fell  into  an  assumption  that  I  think  is 
widely  shared,  but  which  can  cause  problems.  I  had  assumed  that  summing  profile 
counts  across  multiple  runs  of  a  program  was  a  reasonable  approach  to  understanding 
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the  behavior  of  a  program.  But  consider  a  program  that  has  a  function  that  is  called 
once  for  each  element  on  a  list.  For  99%  of  the  elements,  the  function  requires  0(1) 
time  to  execute.  But  for  1  out  of  100  elements,  it  requires  much  more  time.  For 
example,  let  ns  assume  that  the  occurrence  of  a  certain  kind  of  element  requires  that 
it  be  put  in  a  separate  list,  and  that  sorting  this  list  n-element  list  requires  O(n^) 
time  (it  uses  an  inefficient  sorting  algorithm)  where  n  =  1.  If  the  program  is  run 
M  times,  and  the  profile  counts  used  as  measures  of  the  complexity  of  this  function 
are  summed,  then  there  comes  a  point  where  the  one-in-a-hnndred  event  dominates 
the  analysis.  If  we  assume  a  list  that  is  100  elements  long,  and  one  of  the  elements 
causes  a  re-sort,  then  running  the  program  100  times  could  make  the  list  look  like  it 
was  10.000  elements  long,  with  100  re-sorts,  implying  that  the  sorting  of  the  special 
elements  requires  as  much  time  as  the  processing  of  the  non-special  elements,  when 
in  fact  it  never  sorts  a  list  longer  than  one  element. 

In  general,  this  problem  will  rear  its  head  when  evaluation  functions  are 
non-linear  in  the  values  of  the  profile  variable.  For  profilers  like  prof  and  gprof^  this 
may  not  cause  any  particular  problem,  even  though  their  output  does  not  indicate 
how  many  runs  of  the  program  produced  the  data  on  which  they  base  their  analysis. 
Therblig  was  modified  to  count  the  number  of  executions  of  a  program  in  addition 
to  the  counters  specified  in  the  profiling  implementations.  During  analysis,  all  coun¬ 
ters  were  divided  by  the  number  of  program  runs  to  try  to  avoid  problems  similar  to 
the  ones  described  in  the  previous  paragraph.  However,  I  am  not  satisfied  that  this 
avoids  all  problems  of  analysis  from  execution  counts  derived  from  multiple  runs. 
This  needs  to  be  examined  further. 


Evaluation  functions:  The  most  difficult  functions  to  write  in  TYPESETTER  are 
the  evaluation  functions.  While  some  of  the  difficulty  is  due  to  the  fact  that  Fve  never 
had  to  write  functions  that  evaluate  the  potential  performance  of  other  functions  in 
such  numbers  before,  they  bring  their  own  set  of  problems.  For  one  thing,  they  are 
hardly  ever  ‘wrong’,  at  least  not  in  the  sense  that  inaccuracies  produce  obviously 
aberrant  behavior  on  the  part  of  the  program.  I  have  serendipitonsly  discovered 
several  instances  where  evaluation  expressions  I  have  written  do  not  accurately  reflect 
the  performance  of  the  actual  function;  even  ignoring  the  fact  that  these  are  all 
estimates  anyway,  the  results  returned  were  misleading.  Debugging  these  routines 
to  a  reasonable  level  of  accuracy  is  difficult. 

Kenny  and  Lin  [27]  report  a  technique  for  capturing  the  behavior  of  func¬ 
tions  that  might  be  usable  in  a  THERBLIG-like  environment.  The  Implementor  would 
specify  an  expression  with  free  variables  that  he  suspects  would  adequately  capture 
the  behavior  of  the  function  in  question;  for  example.  A*  x  +  B  *  +  C ,  where  x 

and  y  are  parameters  such  as  the  length  of  a  list,  or  size  of  set.  By  executing  the 
function  many  times  on  many  inputs,  an  average  behavior  for  the  function  based 
on  X  and  y  can  be  found  by  determining  appropriate  values  for  A,  B  and  C  with 
a  curve  fitting  algorithm.  While  this  may  be  an  approach  for  rigorously  and  more 
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automatically  producing  evaluation  functions,  it  will  not  reduce  the  amount  of  work 
required  by  the  Implementor  and  may  impede  the  Implementor  from  taking  advan¬ 
tage  of  logical  information  contained  in  the  profile  data.  For  example,  a  curve-fitting 
approach  may  not  be  able  to  handle  knowledge  about  the  density  of  bit  vectors, 
order  of  presentation  of  elements  to  a  function,  etc. 

In  general,  evaluation  functions  need  to  be  easier  to  write  and  debug. 

Evaluating  ADT-invoked  User  functions:  There  are  several  optionals  that  re¬ 
quire  the  names  of  User-defined  functions.  The  ones  I  have  identified  are  ObjToInt, 
IntToObj,  and  compareFcn.  They  present  problems  when  used  because  TYPESET¬ 
TER  has  no  way  of  estimating  the  runtime  resources  of  the  indicated  functions. 
Presumably,  future  systems  will  have  the  User  give  some  indication  of  the  cost  of 
executing  these  functions  so  that  the  evaluation  functions  can  give  better  estimates 
of  the  cost  of  using  implementations  that  require  them.  I  would  like  to  avoid  forcing 
the  User  to  write  evaluation  functions:  that  is  mixing  the  roles  of  User  and  Im¬ 
plementor  too  much.  Exactly  how  to  achieve  the  same  result  without  User-written 
evaluation  routines  is  yet  to  be  determined. 

The  prototype  finesses  the  problem  entirely.  Currently,  the  ObjToInt  func¬ 
tion  must  always  be  a  reference  to  an  integer  field  of  the  object,  and  IntToObj  must 
be  an  array  reference.  This  has  not  been  terribly  restrictive  up  to  this  point,  but  since 
the  maintenance  of  the  array  of  objects  must  be  done  by  the  User,  it  imposes  some 
overhead  that  should  be  eliminated.  Ideally,  a  map  from  integers  to  objects,  and  its 
inverse,  should  not  be  in  the  final  implementation  of  a  program  unless  it  is  needed. 
Currently,  it  will  always  be  there,  whether  Therblig  selects  implementations  that 
use  them  or  not. 

Second-order  effects:  Another  problem  arises  when  there  are  dependencies  in 
the  User  program  that  are  not  part  of  the  information  available  to  a  THERBLIG-like 
analyzer.  Consider  a  program  that  keeps  objects  sorted  on  a  list,  but  has  its  own 
sorted-list  code  rather  than  using  a  library  routine.  The  list  is  created  from  a  set  of 
these  objects,  the  implementation  of  said  set  assigned  by  the  system.  It  could  turn 
out  that  the  implementation  of  the  set  causes  the  elements  to  be  returned  in  an  order 
that  interferes  with  the  efficient  execution  of  the  User’s  code:  i.e.  one  implementation 
of  set  returns  the  elements  in  the  order  of  their  memory  address  which  corresponds 
to  the  order  in  which  they  were  constructed  which  in  turn  corresponds  to  the  order 
data  was  read  from  a  file.  It  is  easy  to  see  that  there  could  be  interference  between 
the  User’s  implementation  and  any  implementation  chosen  by  the  system  for  the  set, 
and  no  amount  of  analysis  of  the  User’s  use  of  the  set  would  uncover  it. 

This  is  outside  the  scope  of  a  Therblig  style  system.  One  of  its  major 
premisses  is  that  looking  at  the  use  of  the  ADTs  alone  is  sufficient  to  make  a  rea¬ 
sonable  assignment,  and  extra- ADT  information  is  simply  not  made  available  to  it. 
I  have  not  encountered  this  kind  of  second-order  effect  in  any  of  the  programs  I  have 
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run  through  Therblig,  but  theoretically  it  is  possible. 

Implementation  containment:  When  a  bit  vector  implementation  of  a  set  of  size 
N  is  instantiated,  then  any  declaration  of  a  smaller  set  could  share  the  code  for  the 
larger  set.  This  would  decrease  the  program’s  memory  size  further,  at  the  expense  of 
making  the  space  allocated  for  some  sets  larger.  Discovering  and  taking  advantage 
of  these  tradeoffs  would  require  the  evaluation  functions  to  consider  space  as  well  as 
time  in  their  analysis.  Since  Low,  for  one,  has  already  considered  the  more  complex 
space-and-time  integral  objective  function  for  minimization,  I  felt  that  duplicating 
this  was  not  necessary  to  my  objectives  and  I  have  concentrated  on  the  simpler 
time-analysis. 

Even  if  Therblig  were  capable  of  handling  the  space  analysis,  there  is 
nothing  in  its  analysis  framework  that  would  allow  the  kinds  of  implementation 
containment  described  above.  In  other  words,  there  is  no  way  for  the  evaluation 
functions  written  by  the  Implementor  to  conclude  “Use  implementation  X  unless 
condition  Y  holds,  in  which  case  use  implementation  Z.”  Again,  future  work  will 
have  to  show,  first,  that  this  is  an  optimization  that  needs  to  be  available  and, 
second,  how  to  obtain  it. 

Design  of  implementation  libraries:  I  have  barely  begun  to  explore  the  pos¬ 
sibilities  in  a  library  of  implementations.  As  mentioned  before,  it  may  be  desirable 
to  have  several  profiling  implementations,  each  capable  of  collecting  certain  kinds 
of  information  that  is  otherwise  difficult  to  obtain.  For  example,  once  a  bit  vector 
implementation  of  a  set  is  determined  to  be  desirable,  another  bit-vector  oriented 
profiling  implementation  could  be  used  to  determine  which  of  the  many  bit  vector 
implementations  would  be  best  for  this  program. 

In  the  interests  of  simplicity,  I  have  also  avoided  making  use  of  the  more 
complex  language  features  available  in  my  base  language,  C-|--|-.  For  instance,  the 
implementations  List^list,  Set_slist,  and  Map_slist  all  use  the  same  implementations 
of  a  linked  list  as  their  underlying  representation.  Currently,  they  each  have  their 
own  copies  of  this  code,  primarily  because  the  kinds  of  profiling  information  collected 
differs  between  the  implementations.  It  is  possible  that  they  could  all  be  derived  from 
a  linked-list  class,  increasing  even  further  the  possibilities  for  code  sharing.  Future 
work  is  needed  to  look  at  integrating  the  class  hierarchy  and  attendant  inheritance 
into  the  library  of  implementations. 


5.2  Summary 

I  have  explored  in  some  detail  the  proposition  that  compilers  and  language 
systems  can  make  use  of  profile  data  in  the  generation  of  code  for  programs,  and 
in  the  synthesis  of  large  software  systems.  I  have  improved  the  existing  ‘optimal’ 
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instrumentation  algorithms,  and  shown  how  arc  counts  can  be  used  to  iiuprove  the 
execution  tiiue  of  programs  on  luachines  with  instruction  caches.  I  have  presented  the 
design  of  a  language  and  attendant  system  that  can  select  for  a  User  the  iiupleiuen- 
tations  of  variables  declared  to  be  of  an  abstract  data  type.  I  have  also  demonstrated 
that  such  a  systeiu  can  luake  reasonable  choices  for  those  implementations  based  on 
the  profile  data  collected  by  abstraction-specific  profiling  implementations. 
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Appendix  A 

The  Knuth- Stevenson  Algorithm 


// 


//  A  program  that  implements  the  MINOPT  algorithm.  MINOPT  finds  the  minimal 
//  set  of  arcs  and  nodes  required  for  implementation  in  a  program  that  allows 
//  execution  counts  of  all  arcs  and  nodes  to  be  computed. 

II  Syntax: 

II  ks  <  graphDescriptionFile 

// 

8  //  This  program  also  implements  Knuth  and  Stevenson^ s  (K-S)  algorithm  for 

9  //  finding  a  minimal  set  of  nodes  to  instrument  in  a  program. 

10  //  Syntax: 

11  //  ks  -0  <  graphDescriptionFile 

12  // 

13  //  The  differences  between  the  two  algorithms  are  controlled  by  the  boolean 

14  II  flag  MINOPTing;  these  areas  are  highlighted  with  a  right  comment: 

15  //  f  //  MINOPT 

16  // 

17 

18  #include  <stream.h> 

19  #include  <stdio.h> 

20  #include  <assert.h> 

21  #include  "util.H" 

22  #include  "Tokens. H" 

23  #include  "userTypes .H" 

24  #include  "ks_ADTs.H" 

25  #include  "IMPLSRCS.H" 

26 

27  boolean  MINOPTing  =  true; 

28 


29  void 

30  error (char  ^msg) 

31  { 


//  MINOPT 
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32  cerr  «"Error:  "  «  msg  «  "\n"; 

33  > 

34 

35  void 

36  fatal (char  *msg) 

37  i 

38  cerr  «  "Fatal  error:  "  «  msg  «  "\n"; 

39  abort  0  ; 

40  > 

41 

42  Define ( 'maxNof GraphObjs  ^ , 1000)0 ; 

43  Def  ine ( 'maxNofGraphObjsml \ ‘EvaKmaxNofGraphObjs  -  1)0®; 

44  Define ( 'go_bminfo O 'lowerb=0,  upperb=maxNofGraphQbjsml, 

45  IntToObj=GraphObjNo ,  QbjToInt=NofGraphQbj O®; 

46 

47  // 

48  //  class  definitions 

49  // 

50 

51  CLASS (Registration_obj ) 

52  { 

53  friend  main() ; 

54  Any*  reg; 

55  int  size; 

56  int  last; 

57  public: 

58  Registration_obj (int  maxsz)  {  last  =  -1;  size  =  maxsz; 

reg  =  new  Any  [size] ;  } 

59  Any&  operator [] (int  i)  { 

60  if  (i  <  0  I  I  i  >=  size)  fatalC'illegal  registration  number"); 

61  return  reg[i] ; 

62  > 

63  int  next (Any  obj ,  char*  str) ; 

64  >SSALC 

65 

66  CLASS (Flowitem_obj) 

67  { 

68  public: 

69  Flowitem_obj (GraphObj  o,  boolean  p)  { 

70  GraphObj  obj ; 

71  boolean  plus;  //  false  if  value 

72  >SSALC 

73 

74 

75  CLASS (GraphObj _obj) 


obj  =  o;  plus  =  p;  } 
is  to  be  subtracted 
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76  { 

77  friend  class  Registration_obj ; 

78  friend  int  NofGraphObj (GraphObj  go) ; 

79  friend  GraphObj  GraphObj No (int  i) ; 


80 

boolean 

IsArc;  //  false  if  node 

81 

int 

rnum; 

82 

public : 

83 

static  Registration_obj  go_R; 

84 

//  data 

85 

int 

freq; 

86 

double 

cost;  //  the  cost  of  measuring  this  object;  defaults  to 

87 

Token 

name ; 

88 

boolean 

instrument;  //  true  if  obj  is  to  be  instrumented 

89 

DECLARE(f lowlist  jList  jFlowitem) ; 

90 

//  structors 

91 

GraphObj 

_obj (Token  t,  boolean  b  ); 

92 

//  functions 

93 

boolean 

isArcO  -[  return  IsArc;  } 

94 

boolean 

isNodeO  -[  return  !  IsArc;  } 

95 

int 

sum() ;  //  add  up  the  flowitems  for  this  object 

96 

// 

97 

//  data  ; 

and  : 

functions  specific  to  the  Knuth-Stevenson  algorithm 

98 

//  I  : 

have 

maintained  the  variable  names  from  the  Knuth  and 

99 

//  Stevenson  paper  in  BIT  13  (1973),  pp  323-337. 

100 

void 

arcto (GraphObj ) ;  //  to  make  a  node  for  each  node  and  arc 

101 

//  in  the  original  graph 

102 

GraphObj 

equivTo;  //  A  graph  object  that  belongs  to  the 

103 

//  same  equivalence  class  as  THIS  object. 

104 

//  ==  nil  if  THIS  object  is  the 

105 

//  representative  of  its  equivalence  class. 

106 

// 

107 

GraphObj 

superO;  //  Returns  the  unique  graph  object 

108 

//  that  is  the  representative  object 

109 

//  of  the  equivalence  class  to  which 

110 

//  THIS  graph  object  belongs  (a 

111 

//  function  that  chases  the 

112 

//  equivTo  chain) . 

113 

// 

114 

GraphObj 

follow;  //  creates  an  super-arc  in  the  reduction  graph 

115 

//  from  THIS  object  to  the  FOLLOW  object. 

116 

// 

117 

boolean 

d; 

118 

GraphObj 

u; 

119 

GraphObj 

compfather ; 

120 

//  These 

three  attributes  are  used  only  for  vertices  in  the  reduced 
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121  //  graph.  We  build  the  spanning  tree  of  connected  components  in  the 

122  //  reduced  graph  as  we  go.  There  is  one  arc  in  the  reduced  graph 

123  II  for  each  vertex  and  arc  in  the  original  graph.  If  CQMPFATHER  == 

124  II  nil  then  this  vertex  represents  an  (super)  arc  in  the  reduced 

125  //  graph,  viz  an  arc  from  u.superO  to  u.followO  .superO  .  If  d, 

126  //  then  that  arc  goes  from  THIS  to  CQMPFATHER,  else  the  arc  goes 

127  //  from  CQMPFATHER  to  THIS. 

128  // 

129  GraphQbj  compO; 

130  //  The  representative  of  this  supervertex ^s  component  in  the  reduced 

131  //  program  flow  graph  constructed  so  far;  analogous  to  super (),  above. 

132  void  makeComponent 0 ; 

133  void  createRedArc 0 ; 

134  >SSALC 

135 

136 

137  CLASS (Node_obj )  :  public  GraphQbj_obj 

138  { 

139  public: 

140  //  data 

141  DECLARE (gozoutas ,  Set,  Arc,  go_bminfo) ; 

142  DECLARE (gozintas ,  Set,  Arc,  go_bminfo) ; 

143  //  functions 

144  void  gozinta(Arc) ; 

145  void  gozouta(Arc) ; 

146  int  sumGozintas 0 ; 

147  int  sumGozoutas 0 ; 

148  //  structors 

149  Node_obj (Token  t) ; 

150  >SSALC 

151 

152  CLASS (Arc_obj )  :  public  GraphQbj_obj 

153  { 

154  public: 

155  //  data 

156  Node  from; 

157  Node  to; 

158  //  structors 

159  Arc_obj (Token  t) ; 

160  //  functions 

161  void  goes(Node  F,  Node  T) 

162  { 

163  (from=F)->gozouta(this) ; 

164  (to=T)->gozinta(this) ; 

165  > 
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166  >SSALC 

167 

168  @@  ================  Registration  ================ 

169 

170  Registration_obj :: next (Any  obj ,  char*  str) 

171  i 

172  if  (last  ==  size-1)  {. 

173  int  newsize  =  (9*size)/8; 

174  Any*  newreg  =  new  Any [newsize] ; 

175  int  i; 

176  for  (i  =  0;  i  <  size;  i++)  newreg[i]  =  reg[i] ; 

177  delete  [size]reg; 

178  size  =  newsize; 

179  reg  =  newreg; 

180  cerr  «  "Warning:  registration  for  "  «  str  «  "  increased  to  "  «  size 

181  «  "\n"; 

182  > 

183  reg[++last]  =  obj; 

184  return  last; 

185  > 

186 

187  //  ================  Graph  objects  ================ 

188 

189  Registration_obj  GraphQbj_obj : :go_R(maxNofGraphQbjs) ; 

190  int  NofGraphObj (GraphObj  go) 

191  { 

192  return  ( ((GraphObj) go ) ->rnum) ; 

193  > 

194  GraphObj  GraphObj No (int  i) 

195  { 

196  return  (GraphObj_obj : : go_R [i] ) ; 

197  > 

198 

199  CONSTRUCTOR (GraphObj _obj ,  Token  t,  boolean  b) 

200  ■{ 

201  name  =  t; 

202  IsArc  =  b; 

203  freq  =  -99999999; 

204  cost  =  1; 

205  instrument  =  false; 

206  equivTo  =  nil; 

207  follow  =  nil; 

208  u  =  nil; 

209  compfather  =  nil; 

210  rnum  =  go_R.next (this , "GraphObj ") ; 
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211  > 

212 

213  void 

214  GraphQbj_obj : :arcto(GraphQbj  tgt) 

215  { 

216  if  (follow  ==  nil)  {. 

217  follow  =  tgt; 

218  > 

219  else  { 

220  //  make  tgt  and  follow  equivalent 

221  tgt  =  tgt->super 0 ; 

222  if  (tgt  !=  f ollow->super 0 )  { 

223  tgt->equivTo  =  f ollow->super () ; 

224  > 

225  > 

226  > 

227 

228  GraphObj 

229  GraphObj_obj : : super () 

230  { 

231  return  (equivTo  ==  nil?  this  :  equivTo->super () ) ; 

232  > 

233 

234  GraphObj 

235  GraphQbj_obj  :  :  compO 

236  { 

237  return  (compfather  ==  nil?  this  :  compfather->comp() ) ; 

238  > 

239 

240  void 

241  GraphQbj_obj : rmakeComponent () 

242  { 

243 

244 

245 

246 

247 

248 

249 

250 

251 

252 

253  > 

254 

255  V 


//  transform  the  d,  u,  and  compfather  attributes  of  the 
//  supervertices  so  that  this  supervertex  is  the  repres 
//  its  reduction  graph  spanning  tree  component, 
if  (compfather  !=  nil)  -[ 

compf ather->makeComponent 0  ; 
compf ather->compf ather  =  this; 
compf ather->d  =  !d; 
compf ather->u  =  u; 
compfather  =  nil; 

> 


id 


ive  of 
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256  GraphObj_obj : : createRedArc 0 

257  { 

258  //  create  the  arc  in  the  reduction  graph  that  will  correspond  to 

259  II  THIS  graph  object 

260  GraphObj  v,  w; 

261  V  =  super () ; 

262  v->makeComponent 0  ; 

263  w  =  f ollow->super 0 ; 

264  //  if  V  and  w  not  in  the  same  tree  of  the  forest  of  intermediate 

265  //  spanning  trees,  ... 

266  if  (v  !=  w->comp())  { 

267  //  put  in  spanning  tree 

268  v->compf ather  =  w; 

269  v->d  =  true; 

270  v->u  =  this; 

271  instrument  =  false; 

272  > 

273  else  { 

274:  II  update  the  flow  items 

275  instrument  =  true; 

276  while (  w  !=  v  )  { 

277  Flowitem  fi  =  new  Flowitem_obj (this ,  w->d) ; 

278  List_appendl (w->u->f lowlist ,  f i) ; 

279  w  =  w->compf ather ; 

280  > 

281  > 

282  > 

283 

284  //  ================  Nodes  ================ 

285 

286  CONSTRUCTOR (Node_obj ,  Token  t)  CC  GraphObj_obj (t , false) 

287  { 

288  > 

289 

290  void 

291  Node_obj : : gozinta(Arc  a) 

292  { 

293  Set_add(gozintas ,  a); 

294  > 

295 

296  void 

297  Node_obj : : gozouta(Arc  a) 

298  ■{ 

299  Set_add(gozoutas ,  a); 

300  > 


119 


301 

302  int 

303  Node_obj : : sumGozintas 0 

304  { 

305  Arc  a; 

306  int  sum  =  0; 

307  forAlKa,  gozintas, 

308  sum  +=  a->freq; 

309  ) ; 

310  return  sum; 

311  > 

312 

313  int 

314  Node_obj : : sumGozoutas 0 

315  { 

316  Arc  a; 

317  int  sum  =  0; 

318  forAlKa,  gozoutas, 

319  sum  +=  a->freq; 

320  ) ; 

321  return  sum; 

322  > 

323 

324  //  ================  Arcs  ================ 

325 

326  CONSTRUCTOR (Arc_obj ,  Token  t)  CC  Graph0bj_obj (t ,  true) 

327  { 

328  from  =  nil; 

329  to  =  nil; 

330  > 

331 

332  // 

333  //  ================  the  program  ================ 

334  // 

335 

336  DECLARE (Graph, Set jGraphObj ,  go_bminfo) ; 

337  DECLARE (diet, Map, Token, GraphObj) ; 

338 

339  Node 

340  readNodeO 

341  { 

342  Node  n; 

343  Token  t; 

344  cin  »  t; 

345  if  (Map_in(dict ,t) )  {. 
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346  Map_value (diet ,  t,  n) ; 

347  > 

348  else  { 

349  n  =  new  Node_obj (t) ; 

350  Map_def ine (diet ,  t,  n) ; 

351  > 

352  return  n; 

353  > 

354 

355  boolean 

356  readAreO 

357  i 

358  II  input  is  a  set  of  lines  of  the  form: 

359  //  arename  freq  eost  nodename  nodename  , 

360  // 

361  Token  t; 

362  double  k; 

363  ein  »  t; 

364  if  (ein.eofO)  return; 

365  assert( !Map_in(diet,t)) ; 

366  Are  a  =  new  Are_obj  (t) ; 

367  ein  »  a->freq; 

368  ein  »  a->eost; 

369  a->eost  *=  a->freq; 

370  Map_def ine (diet ,  t,  a); 

371  //  now  read  the  from  and  to  nodes 

372  Node  F  =  readNodeO; 

373  Node  T  =  readNodeO; 

374  Set_add(Graph,a) ; 

375  Set_add(Graph,F) ; 

376  Set_add(Graph,T) ; 

377  a->goes (F,T) ; 

378  if  (MINOPTing)  ■{ 

379  //  now  do  the  equivalent  of  K-S  areto  funetion,  exeept  we  do  it  for 

380  //  the  nodes  and  the  ares 

381  F->areto(a);  //  MINQPT 

382  a->areto(T); 

383  y 

384  else  { 

385  F->areto(T); 

386  y 

387  //  done 

388  ein  »  t; 

389  if  (t  ==  eommaToken)  return  true; 

390  return  false; 
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391  > 

392 


393  void 

394  readGraphO 

395  { 

396  while  (readArcO); 

397  > 

398 


399  void 


400  propagateCounts 0 

401  i 

402  GraphOb j  o ; 

403  forAlKo,  Graph, 


if  (o->isArc())  { 

assert (o->freq  >=  0); 

> 

else  { 

Node_obj&  n  =  *((Node)o); 
int  t  =  n. sumGozintas 0 ; 
assert(n.freq  <  0  | |  n.freq  ==  t) ; 
n.freq  =  t; 

n.cost  =  (double)n.freq; 
assert(n.freq  ==  n. sumGozoutas () ) ; 
> 

); 


404 

405 

406 

407 

408 

409 

410 

411 

412 

413 

414 

415 

416  > 

417 

418  int 

419  gobjCmp (GraphOb j  fl,  GraphOb j  f2) 


420  { 

421  if  (fl->cost  >  f2->cost)  return  -1; 

422  if  (fl->cost  <  f2->cost)  return  1; 

423  return  0; 

424  > 

425 


426  main(int  argc,  char  **argv) 

427  { 

428  GraphOb j  go ; 

429  double  sum  =  0; 

430  DECLARE(sortedObjList ,  List,  GraphObj); 

431  if  (argc  >  1)  { 

432  if  (strcmp(argv[l] ,  "-o")  ==  0)  MINOPTing  =  false; 

433  else  {. 

434  fatal ("Unrecognized  option"); 

435  > 
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436 

437 

438 

439 

440 

441 

442 

443 

444 

445 

446 

447 

448 

449 

450 

451 

452 

453 

454 

455 

456 

457 

458 

459 

460 

461 

462 

463 

464 

465  > 


> 

readGraphO  ; 
propagateCounts 0 ; 

Set_sort2 (Graph,  sortedObjList ,  gobjCmp) ; 
forAlKgo,  sortedObjList, 

if  (MINOPTing  ||  go->isNode () )  {  // 

go->createRedArc 0 ; 

> 

); 

forAlKgo,  sortedObjList, 

if  (! MINOPTing  &&  go->isArc())  continue;  // 

cout  «  go->name; 
if  (go->instrument)  {. 

cout  «  instrument  cost="  «  go->cost  «  "\n"; 
sum  +=  go->cost; 

> 

else  { 

//  print  equation  that  computes  count  for  this  object 
cout  «  "= 

Flowitem  item; 
forAlKitem,  go->f lowlist , 

if  (item->plus)  cout  «  "+"; 

else  cout  « 

cout  «  item->obj->name ; 

); 

cout  «  "\n"; 

> 

); 

cout  «  "Instrumentation  cost  =  "  «  sum  «  "\n"; 


MINOPT 


MINOPT 


123 


Appendix  B 


Therblig 


1  0  FILE:  types. t 

2  @  terminology: 

3  0  The  class  that  is  a  specific  implementation  of  an  ADT  is  called 

40  a  "representation  class"  of  that  ADT,  since  an  ADT  is  not  just  a 

5  0  set  of  functions  but  also  a  representation  of  the  data. 

6  0 

7  0  The  ADT  is  defined  as  a  set  of  functions  operating  on  objects  of  that 

8  0  type.  A  representation  class  will  have  zero,  one,  or  more 

9  0  "implementations"  of  those  interface  functions. 

10  0 

11  #include  <stream.h> 

12  #include  <stdio.h> 

13  #include  "userTypes .H" 

14  #include  "Tokens. H" 

15  #include  "IMPLSRCS.H" 

16  #include  "option_types .H" 

17 

18  Define ( 'maxNofVarDecls ’ , 1000)0 ; 

19  Define ( 'maxNofVarDeclsml \ 'Eval (maxNofVarDecls  -  1)0®; 

20  Def ine ( ' vd_bminf 0 O ' lowerb=0 ,  upperb=maxNofVarDeclsml , 

21  IntToObj=VarDeclNo ,  0bjToInt=NofVarDeclO®; 

22 

23  Define ( 'maxNofADTabsFcn^ ,200)0; 

24  Define ( 'maxNof ADTabsFcnml ’ , ‘Eval(maxNof ADTabsFcn-1) O®; 

25  Define ( 'adtaf_bminfo O 'lowerb=0,  upperb=maxNof ADTabsFcnml , 

26  IntToObj=ADTabsFcnNo  ,  QbjToInt=Nof  ADTabsFcnO®; 

27 

28  Define ( 'maxNofADTimpFcn^ ,200)0; 

29  Define ( 'maxNof ADTimpFcnml ’ , 'Eval(maxNof ADTimpFcn-1) O®; 

30  Define ( 'afd_bminfo O 'lowerb=0,  upperb=maxNof ADTimpFcnml , 

31  IntToQbj=ADTimpFcnNo ,  ObjToInt=Nof ADTimpFcnO®; 
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32 

33  Define ( 'maxNof ADTcallSite ^ ,225)0; 

34  Define (‘maxNofADTcallSiteml^ ‘Eval(maxNof ADTcallSite-1) 

35  Define ( 'acs_bminfo \ 'lowerb=0,  upperb=maxNof ADTcallSiteml , 

36  IntToObj=ADTcallSiteNo,  QbjToInt=NofADTcallSiteO®; 

37 

38  CLASS (Registration_obj ) 

39  { 

40  friend  main() ; 

41  Any*  reg; 

42  int  size; 

43  int  last; 

44  public: 

45  Registration_obj (int  maxsz)  {  last  =  -1;  size  =  maxsz; 

reg  =  new  Any  [size] ;  } 

46  Any&  operator  [] (int  i)  { 

47  if  (i  <  0  I  I  i  >=  size)  fatalC'illegal  registration  number"); 

48  return  reg[i] ; 

49  > 

50  int  next (Any  obj ,  char*  str) ; 

51  >SSALC 

52 

53 

54  CLASS (Prof array_obj ) 

55  { 

56  Token  filename;  //ON  the  filename  for  this  array 

57  long  size;  //ON  the  size  of  this  array 

58  long  *iarray;  //ON  the  array  itself 

59  double  *array;  //ON  avgM  over  runs 

60  boolean  Valid;  //ON  did  the  read  work? 

61  public: 

62  Prof array_obj (FILE  *pf.  Token  fn,  int  sz) ; 

63  ~Profarray_obj  0  {.  delete  [size+1]  array ;  } 

64  doublefe  operator  []  (int  i)  {.  return  array  [i]  ;  } 

65  boolean  validO  {.  return  Valid;  } 

66  Token  file()  {  return  filename;  } 

67  >SSALC 

68 

69  #define  maxNof QptsPerADT  10 

70 

71  CLASS (0ptional_obj) 

72  {  0  optionals  as  they  are  read  in  from  the  input  file 

73  0  this  class  is  actually  overloaded.  When  the  ADT  is  being  defined, 

74  0  a  list  of  these  is  created  for  each  optional  that  is  possible  on  a 

75  0  variable  declaration.  The  ival  is  then  the  index  of  that  optional  for 
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76  0  that  ADT. 

77  0  When  the  optionals  are  read  on  an  actual  variable  declaration, 

78  @  then  the  fields  are  filled  in  as  documented  below. 

79  public: 

80  int  idx;  @  the  index  of  the  optional 

81  optionType  stable;  0  the  table  it  is  an  index  into 

82  Token  sval;  0  the  value  represented  as  a  token 

83  int  ival;  0  the  value  as  an  integer,  if  ivalid 

84  bool  ivalid;  @  is  the  value  a  valid  integer? 

85  @  fens 

86  Qptional_obj (Token  val,  optType  intp,  int  idx,  optionType  *tbl) ; 

87  Qptional_obj (int  value) ; 

88  Qptional_obj  () ; 

89  ostreamfe  print (ostreamfe) ; 

90  ostreamfe  printForm(ostream&) ; 

91  >SSALC 

92 


93  CLASS (VarDecl_obj) 

94  { 

95  void 

96  public: 


init_VarDecl (void) ; 


97 

static  Registration_obj  vd 

-R; 

98 

Token 

name ; 

@  my  name ; 

99 

ADType 

vd_ADT; 

@  my  abstract  type; 

100 

boolean 

implemented; 

0  means  that  this  variable  has  bei 

101 

0  assigned  a  representation  (i.e. 

102 

@  vd_repr  contains  a  valid  impln  ■ 

103 

ADTRepr 

vd_repr ; 

0  my  representation  type;  used  by 

104 

ADTRepr 

vd_bestRepr ; 

105 

DECLARE (vd_adtParms ,  List , 

Token) ;  @0  parms  in  my  DECLARE 

106 

0  (including  optionals?) 

107 

DECLARE (vd_inSigs0f ,  Set, 

ADTcallSite,  acs_bminfo) ; 

108 

AliasSet 

vd_as ; 

109 

int 

rnum; 

110 

111 

@0  OPTIONALS 

112 

@0 

113 

boolean 

optsParsed; 

114 

Optional 

vd_opts [maxNof OptsPerADT] ;  //  not  nil  if  present 

115 

String 

instance_name ,  instance_parm; 

116 

String 

coercion_name ,  coercion_parm; 

117 

String 

constructor_ 

parms ; 

118 

0  fens 

119 

VarDecl_ 

obj (Token,  ADType, 

ADTRepr) ;  ®  #1 

120 

VarDecl_ 

obj (Token,  ADType, 

Token) ;  @  #1 

126 

121  VarDecl_obj (Token,  ADType) ;  0  #2 

122  VarDecl_obj (Token,  Token);  @  check  that  the  2nd  is  an  ADType  name 

123  void  aliasOf (VarDecl) ; 

124  void  betterRepr(){assert (implemented) ;  vd_bestRepr  =  vd_repr;  } 

125  0  Note  that  the  difference  between  #1  and  #2  is  that  #1  assigns  a 

126  @  representation,  while  #2  does  not. 

127  boolean  operator==(VarDecl_obj  fethat) ; 

128  boolean  operator ! =(VarDecl_obj  fethat)  {  return  ! (*this  ==  that);  } 

129  ostreamfe  print (ostreamfe) ; 

130  ostreamfe  printForm(ostream&) ; 

131  >SSALC 

132 

133  CLASS (Signature_obj ) 

134  { 

135  public: 

136  DECLARE(sig_sig,  List,  VarDecl); 

137  @  fens 

138  Signature_obj  () ; 

139  void  add(VarDecl  V) ; 

140  void  add(Token,  Token,  Token) ;  Ocreate  the  VarDecl  yourself 

141  void  add(VarDecl,  Token,  Token); 

142  void  add(Token,  ADType,  Token);  @  ditto 

143  void  add(Token,  ADType,  ADTRepr) ; 

144  void  add_dontCare 0 ; 

145  int  len(void) ; 

146  ostreamfe  print (ostreamfe) ; 

147  ostreamfe  printForm(ostream&) ; 

148  >SSALC 

149 

150  CLASS (ADType_obj) 

151  ■{  @  An  AbstraetDataType ;  defined  by  a  set  of  AbstractFunctionDef initions . 

152  public: 

153  bool  adt_inited; 

154  int  adt_number; 

155  Token  name; 

156  ADTRepr  adt_prof ilelmpl ;  0  my  profiling  representation 

157  DECLARE (adt_reprs ,  Map,  Token,  ADTRepr);  0  my  representations 

158  DECLARE (adt_afcns ,  Set,  ADTabsFcn,  adtaf_bminfo) ; 

@  abstract  fens  defining  my  interface 

159  @@DECLARE(adt_optionals ,  Map,  Token,  Optional);  @ 

160  @  fens 

161  ADType_obj (Token) ; 

162  ADType_obj (char  *) ; 

163  ostreamfe  print (ostreamfe) ; 

164  ostreamfe  dump(ostream&) ; 
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165  boolean  operator==(ADType_obj  r)  {.  return  (name  ==  r.name);  } 

166  boolean  operator ! =(ADType_obj  r)  -[  return  (name  ==  r.name);  } 

167  >SSALC 

168 

169  CLASS (ADTRepr_obj) 

170  {.  0  the  representation  of  an  AbstractDataType ; 

171  0  an  ADT  implementation  consists  of  a  set  of  sets  of  implementations  of  the 

172  0  ADTs  functions. 


173 

0  Note: 

name  ==  adtr_of->name 

+ 

)  ) 

+  adtr_suffix 

174 

public : 

175 

bool 

adtr_inited; 

176 

Token 

name ; 

0 

My  name  (Set_l,  List_3,  Map_2,  etc 

177 

ADType 

adtr_of ; 

0 

the 

ADT  I^m  a  representation  of 

178 

Token 

adtr_suff ix; 

179 

int 

adtr_number ; 

for 

accessing  tables 

180 

0  fens 

181 

ADTRepr_ 

.obj (Token) ; 

182 

ADTRepr_ 

.obj (Token,  ADType); 

183 

ostreamfe  print (ostreamfe) ; 

184  void  printName (ostreamfe) ; 

185  boolean  operator==(ADTRepr_obj  r)  {  return  (name  ==  r.name);  } 

186  >SSALC 

187 

188  CLASS (ADTabsFcn_obj) 

189  {.  0  abstract  function  (interface  function) 

190  public: 

191  static  Registration_obj  adtaf_R; 

192  int  adtaf_uid;  0  for  registration 

193  Token  name;  @  the  name  by  which  the  user  invokes  me 

194  ADType  adtaf_for;  @  the  ADT  I^m  in  the  interface  of 

195  Signature_obj  adtaf_sig;  0  my  parameters 

196  ADTimpFcn  evalFcn;  @  the  function  representing  the  profiling 

197  @  implemention  evaluation  function. 

198  DECLARE(adtaf _impl_f cns ,  Set,  ADTimpFcn,  afd_bminfo) ;  0  my  implementations 

199  int  nofProfVars;  @  number  of  profiling  variables  for  this  fen 

200  0  functions 

201  ADTabsFcn_obj (Token,  ADType); 

202  ostreamfe  print (ostreamfe) ; 

203  void  printName (ostreamfe) ; 

204  boolean  operator==(ADTabsFcn_obj  r)  {  return  (name  ==  r.name);  } 

205  >SSALC(adtaf_sig()) 

206 

207  CLASS (ADTimpFcn_obj) 

208  {  @  An  implementation  of  an  abstract  (interface)  function 

209  public: 
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210  static  Registration_obj  afd_R; 

211  int  afd_uid;  0  index  into  evalFcns 

212  Token  name;  0  the  name  by  which  I  am  refM  in  the  library 

213  ADTabsFcn  afd_impl_of;  0  the  abstract  function  I  implement 

214  ADTRepr  repr;  @  the  representation  I^m  a  member  of 

215  Signature_obj  afd_sig;  0  my  parameters; 

216  @  only  types,  not  names,  are  important  in  my  signature;  should 

217  0  probably  be  a  derived  class  of  Signature_obj 

218  0  functions 

219  ADTimpFcn_obj (Token,  ADTabsFcn,  ADTRepr); 

220  ostreamfe  print (ostreamfe) ; 

221  void  printName (ostreamfe) ; 

222  void  typedName (ostreamfe  o)  ■{  o  «  repr->name  «  «  name;  } 

223  boolean  operator==(ADTimpFcn_obj  r) 

224  {.  return  (name  ==  r.name);  } 

225  >SSALC(afd_sig()) 

226 

227 

228  CLASS (ADTcallSite_obj) 

229  -[  @  A  call  site  where  the  user^s  program  has  invoked  one  of  the  interface 

230  @  functions  for  an  ADT. 

231  boolean  implemented; 

232  ADTimpFcn  implementation; 

233  ADTimpFcn  Betterlmpl; 

234  public: 

235  static  Registration_obj  acs_R; 

236  int  acs_ruid;  @  for  registration 

237  int  acs_upid;  @  each  call  site  has  a  unique  id  which  is  its 

238  @  base  index  in  the  profile  eval-function  arrays. 

239  Prof array  acs_parr;  @  the  profile  array  for  this  call  site 

240  ADTabsFcn  acs_afcn;  @  the  abstract  function  being  invoked 

241  Signature_obj  acs_sig;  0  my  actual  parameters;  names  given,  types  computed 

242  int  acs_line;  0  the  line  number  of  the  file  I^m  in 

243  double  acs_rank; 

244  @  constructors 

245  ADTcallSite_obj (int ,  Prof array,  int,  ADTabsFcn); 

246  0  functions 

247  double  eval (ADTimpFcn) ; 

248  double  eval()  {  assert (implemented) ;  return  eval (implementation) ;  } 

249  ostreamfe  print (ostreamfe) ; 

250  ostreamfe  printForm(ostream&) ; 

251  void  printName  (ostreamfe  font)  {.  font  «  acs_line;  } 

252  void  implement (ADTimpFcn  f)  ■{  implemented  =  true;  implementation  =  f;  } 

253  void  unimplement 0  -[  implemented  =  false;  implementation  =  nil;  } 

254  void  betterlmpl 0  {  assert (implemented  &&  implementation ! =nil) ; 
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255  BetterImpl=implementation;  } 

256  >SSALC(acs_sig()) 

257 

258 

259  CLASS (UserFcnDecl_obj) 

260  { 

261  0  We  have  to  know  the  structure  of  some  user  functions^  signatures; 

262  0  user  fens  can  cause  either  conversion  functions  to  be  invoked 

263  0  or  force  certain  bindings  (e.g.  if  this  global  is  assigned  this 

264  0  representation,  then  this  formal  parm  must  also  have  it) .  Depends 

265  @  on  whether  the  analyzer  has  been  implemented  with  conversion-on-calls 

266  @  implemented. 

267  public: 

268  int  ufd_upid;  0  each  user  function  has  a  unique  id 

269  Token  name;  @  the  name  of  the  user  function 

270  Signature_obj  ufd_sig;  @  my  parameters;  abstract  given,  repr  computed. 

271  0  fens 

272  UserFcnDecl_obj (int ,  Token  ); 

273  ostreamfe  print (ostreamfe) ; 

274  >SSALC(ufd_sig()) 

275 

276 

277  CLASS (UserFcnCall_obj) 

278  { 

279  public: 

280  int  ufc_upid;  0  each  call  site  of  a  user  function  has  a  unique  id 

281  UserFcnDecl  ufc_decl;  @  the  function  being  called. 

282  Signature_obj  ufc_sig;  0  my  actual  parameters 

283  0  fens 

284  UserFcnCall_obj (int ,  UserFcnDecl); 

285  ostreamfe  print (ostreamfe) ; 

286  >SSALC(ufc_sig()) 

287 

288 

289  CLASS (AliasSet_obj) 

290  { 

291  friend  class  VarDecl_obj ; 

292  public: 

293  DECLARE(as_set ,  Set,  VarDecl,  vd_bminfo) ; 

294  0  fens 

295  AliasSet_obj (VarDecl) ; 

296  void  merge (AliasSet) ; 

297  >SSALC() 

298 

299 
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@@  FILE: 
#include 
#include 
#include 
#include 
#include 


main.t 
<stream.h> 
<stdio .h> 
<assert .h> 
"util.H" 
<math.h> 


#ifdef  DBG_MALL0C 

extern  void  malloc_verify () ; 

#def ine  MALLOCK  do  {  if  (DebugMalloc)  malloc_verify () ;  ]■  while  (0) 
#else 

#define  MALLOCK 
#endif 


@@ 

@@  read  the  input  file  that  has  all  the  information  we  need: 

@@  (we  might  note  here  that  EVERY  name  in  this  file  MUST  be  unique) 

@@ 

@@  (1)  a  list  of  all  ADTs  available  for  analysis,  and  names  of 

implementations : 

@@  (1.1)  a  list  of  the  interface  functions  and  abstract  parameter 

types ; 

00  (1.2)  a  list  of  the  implementations  of  the  interface  functions  and 

00  the  parameters^  implementation  types. 

@0  (The  Therblig  system  puts  these  in  file  <adt>.th  in  the  <adt>  directory 
@0  impls/<adt>.  The  user  has  a  conventional  way  of  creating  the 
@@  appropriate  declaration  file,  named,  ADTs.th,  for  his  program.) 

@@  Syntax: 

@0  The  ADT  and  its  implementations  on  one  line,  followed  by  several 
@0  lines  of  declarations  of  the  abstract  functions,  followed  by  the 
@@  lines  describing  the  implementations  of  the  abstract  functions.  I.e: 

(§(§ 

m  <ADT>  <ADT_i>  ...  ; 

@@  =  <AbsFcnName>  <signature>  , 

@0  : 

@@  <AbsFcnName>  <ImpFcnName>  <signature>  , 

0®  : 

@@  ; 

@@  :  another  block  like  the  above 

®®  ; 

@@ 

@@  1^11  use  the  equal  sign  as  a  flag  that  this  is  an  abstract  function 

@@  declaration,  and  so  I  won^t  have  to  worry  about  the  order  of  the 
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43  @0  declarations  if  I  decide  to  change  it  later.  The  abstract  defns 

44  @0  will  come  from  the  ADT_P.H  file,  and  the  implementation  defn^s  will 

45  @@  come  from  the  ADT_i.H  files.  They  end  up  in  their  respective  ADT_i.th 

46  @@  files,  which  are  included  into  one  file  ADTs.th  in  the  user^s 

47  @0  directory  by  the  user^s  makefile. 

48  @0  E.g. : 

49  @(§ 

50  @0  Set  Set_l  Set_2  ...  , 

51  @0  =  unionl  Set  Set  , 

52  @@  =  union2  Set  Set  Set  , 

53  0®  =  add  Set  ?  , 

54  ®®  : 

55  @®  unionl  unionl_l_l  Set_l  Set_l  , 

56  ®®  unionl  unionl_l_2  Set_l  Set_2  , 

57  ®®  : 

58  @®  union2  union2_l  Set_l  Set_l  Set_l  , 

59  ®®  union2  union2_2  Set_2  Set_2  Set_2  , 

60  ®®  : 

61  ®®  ; 

62  ®®  List  List_l  List_2  ...  , 

63  ®®  : 

64  ®®  ; 

65  ®®  . 

66  ®® 

67  ®® 

68  ®®  (2)  all  variable  declarations  (Therblig  puts  them  in  ADT_vars . th) . 

69  ®®  Syntax:  var-name  ADT-name  pi  p2  p3  ...  ; 

70  ®®  A  Set  10  int  ...  ; 

71  ®®  : 

72  ®®  . 

73  ®® 

74  ®®  (3)  all  user  function  declarations  of  interest 

75  @®  (Therblig  puts  them  in  ADT_uf cns . th) 

76  @®  Syntax:  fid  user-f cn-name  pl-name  pl-type  p2-name  p2-type  ...  ; 

77  ®®  145  userFcn  A  Set  BTC  List  ...  ; 

78  ®®  : 

79  ®®  . 

80  ®® 

81  ®®  (4)  all  ADT  function  call  sites  (Therblig  puts  them  in  ADT_csites) 

82  ®®  Syntax:  upid  ADT-f cn-name  varl  var2  ...  ; 

83  ®®  1234  unionl  A  B  ; 

84  ®®  : 

85  ®®  . 

86  ®® 

87  ®®  (5)  all  user  function  call  sites  (Therblig  puts  them  in  ADT_ucsites) 
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88  @@  Syntax:  upid  user-f cn-name  varl  var2  ...  ^ 

89  @0  134  userFcn  A  ?  C  . . .  ; 

90  0®  : 

91  . 

92  m 

93  ®® 

94  (1)  must  be  written  by  the  compiler  when  doing  the  implementations. 

95  ®® 

96  (2) -(5)  must  be  written  by  the  compiler  when  doing  the  user^s  program. 

97  @@ 

98 

99 
100 

101  #include  "userTypes .H" 

102  #include  "main_ADTs .H" 

103  Include (types . t) 

104  @@  also  defines  ADTinfoTable 

105  #include  "OptArrays .H" 

106 

107  #include  "EvalFcns.H" 

108 


109  #define  openFile (fn, io , mode , filename , die)  \ 

110  name2 (io , stream)  fn(f ilename ,mode) ;  \ 

111  if  (die  &&  fn.failO)  fatalC'Could  not  open  file"); 

112 

113  #define  f openFile (fn, mode , filename , die)  \ 

114  FILE  *fn  =  fopen(f ilename ,  mode);  \ 


115  if  (die  &&  fn  ==  0)  fatalC'Could  not  fopen  file"); 

116 

117 

118  ®®CQERCN_CLASSES 

119 

120 

121  ®®  =====  GLOBALS  ===== 

122  ®®Debug(Std) 

123  ®®DebugStack 

124  ®®DebugPools 

125  DECLARE(ADTs ,  Map,  Token,  ADType) ;  //  abstract  data  types 

126  ®®Debug(0ff) 

127  DECLARE (ADTReprs ,  Map,  Token,  ADTRepr) ;  //  their  representations 

128  DECLARE (ADTafcns ,  Map,  Token,  ADTabsFcn) ;  //  abstract  functions 

129  ®®  DECLARE(ADTif cns ,  Map,  Token,  ADTimpFcn) ;  //  their  implementations 

130  DECLARE(ADTcalls ,  Set,  ADTcallSite,  acs_bminf o) ; //  their  call  sites 

131  DECLARE(Vars ,  Map,  Token,  VarDecl) ;  //  variables  in  the  program 

132  DECLARE (UserFcns ,  Map,  Token,  UserFcnDecl) ;  //  user  declared  functions 
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133  @0  DECLARE(UserCalls ,  List,  UserFcnCall) ;  //  user  call  sites 

134  DECLARE (Prof Arrays , Map,  Token,  Prof array) ;  //  the  profile  data 

135 

136  bool  prof ileDataValid; 

137  double  curAssignCost ; 

138  int  cutoff Index; 

139  int  cutoff Percent ; 

140 

141  ADType  dontCareADT; 

142  ADTRepr  dontCareRepr ; 

143 

144  //  The  following  defines  are  due  to  therblig^s  minimal  parsing  ability 

145  #define  VarDecl_obj_print  (void  *)VarDecl_obj : :print 

146  #define  VarDecl_obj_printForm  (void  *)VarDecl_obj : :printForm 

147  #define  Token_obj_print  (void  ’t=)Token_obj  :: print 

148  #define  ADTRepr_obj_printName  (void  *) ADTRepr_obj : :printName 

149  #define  ADTabsFcn_obj_print  (void  ’t=)ADTabsFcn_obj  :: print 

150  #define  ADTabsFcn_obj_printName  (void  *) ADTabsFcn_obj : :printName 

151  #define  ADTimpFcn_obj_print  (void  *) ADTimpFcn_obj : :print 

152  #define  ADTimpFcn_obj_printName  (void  *) ADTimpFcn_obj : :printName 

153  #define  c_print  (void  *)c->print 

154  #define  ADTcallSite_obj_print  (void  *) ADTcallSite_obj : :print 

155  #define  ADTcallSite_obj_printName  (void  *) ADTcallSite_obj : :printName 

156 

157 

158  (§(§  THE  CLASS  ROUTINES  FOR  THERBLIG 

159 

160  @@  =====  print  routines  ==== 

161 

162  #define  DefPrinter (type)  \ 

163  ostreamfe  operator« (ostreamfe  font,  type  v)  {.  return  v->print (f out) ;  } 

164  #define  Def PrintForm(type)  \ 

165  ostreamfe  operator« (ostreamfe  fout,  type  v)  ■{  return  v->printForm(fout) ;  } 

166  DefPrinter (Optional) 

167  DefPrinter (VarDecl) 

168  ostreamfe  operator« (ostreamfe  fout,  Signature_obj  &v) 

169  {.  return  v.printForm(f out) ;  } 

170  DefPrinter (ADType) 

171  DefPrinter (ADTRepr) 

172  DefPrinter (ADTabsFcn) 

173  DefPrinter (ADTimpFcn) 

174  Def PrintForm(ADTcallSite) 

175  DefPrinter (UserFcnDecl) 

176  DefPrinter (UserFcnCall) 

177 
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178  @@  ================  Registration  ================ 

179 

180  Registration_obj :: next (Any  obj ,  char*  str) 

181  { 

182  if  (last  ==  size-1)  { 

183  int  newsize  =  (9*size)/8; 

184  Any*  newreg  =  new  Any [newsize] ; 

185  int  i; 

186  for  (i  =  0;  i  <  size;  i++)  newreg[i]  =  reg[i] ; 

187  delete  [size]reg; 

188  size  =  newsize; 

189  reg  =  newreg; 

190  cerr  «  "Warning:  registration  for  "  «  str  « 

191  "  increased  to  "  «  size  «  "\n"; 

192  > 

193  reg[++last]  =  obj; 

194  return  last; 

195  > 

196 

197 

198  @@  =====  Prof array_obj  ===== 

199 

200  CONSTRUCTOR (Prof array_obj ,  FILE  *pfile.  Token  fn,  int  sz) 

201  { 

202  @@  New  requirement:  the  last  entry  of  each  profile  array  is  the  number 

203  @@  of  times  that  profile  array  was  written  to.  When  we  read  the  array  in 

204  @0  it  is  converted  from  long  to  double,  with  each  entry  divided  by  the 

205  @@  execution  count. 

206  Valid  =  false; 

207  filename  =  fn; 

208  size  =  sz; 

209  iarray  =  new  long[sz+l] ; 

210  array  =  new  double [sz+1]  ; 

211  if  (f read(iarray ,  sizeof (long) ,  size+1,  pfile)  !=  size+1)  { 

212  cerr  «  "Profile  data  array  "  «  fn  «  "  corruption?  Not  valid. \n"; 

213  cerr  «  "  size+l="  «  size+1  «  "?\n"; 

214  f close (pfile) ; 

215  return; 

216  } 

217  double  nof Executions  =  (double) iarray [sz] ; 

218  assert (nof Executions  >  0); 

219  for  (int  i  =  0;  i  <  sz;  i++)  { 

220  array [i]  =  iarray [i]  /  nof Executions ; 

221  } 

222  array  [sz]  =  iarray [sz]  ; 
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223  delete  [sz+1] iarray ; 

224  Valid  =  true; 

225  > 

226 

227  @0  =====  Optional_obj  ===== 

228 

229  CQNSTRUCTQR(Optional_obj  , Token  valjOptType  type,int  index, optionType  =t=tbl) 


230  { 

231 

sval  =  val; 

232 

idx  =  index; 

233 

table  =  tbl; 

234 

if  (type  ==  int_opt_type)  {. 

235 

assert (val  !=  nil); 

236 

if  (val->type  ==  num_tkn)  {. 

237 

ivalid  =  true; 

238 

ival  =  val->val; 

239 

> 

240 

else  {. 

241 

cerr  «  "Warning:  "  «  val  «  "  is 

not  an  integer\n" ; 

242 

ivalid  =  false; 

243 

> 

244 

> 

245 

else  if  (type  ==  tmpint_opt_type)  { 

246 

cerr  «  "Warning:  "  «  val  «  "  is  not 

a  valid  optional\n" 

247 

> 

248  > 

249 

250  CQNSTRUCTQR(Optional_obj )  //  used  by  f easibility/eval  routines 

251  i 

252  ivalid  =  false; 

253  sval  =  nil; 

254  idx  =  0x4FFFFFFF ;  //  something  to  cause  a  problem  if  used 

255  table  =  nil; 

256  > 

257 

258  C0NSTRUCT0R(0ptional_obj ,  int  value)  //  used  by  f easibility/eval  routines 

259  { 

260  ivalid  =  true; 

261  ival  =  value; 

262  sval  =  nil; 

263  idx  =  0x4FFFFFFF ;  //  something  to  cause  a  problem  if  used 

264  table  =  nil; 

265  > 

266 

267  ostream  & 


136 


268  Optional_obj :: print (ostreamfe  font) 

269  { 

270  return  fout  «  table [idx] .name ; 

271  > 

272 

273  ostream  & 

274  Qptional_obj : :printForm(ostream&  fout) 

275  i 

276  return  fout  «  table [idx] .name ; 

277  } 

278 

279  @@  =====  VarDecl_obj  ===== 

280 

281  Registration_obj  VarDecl_obj : : vd_R(maxNofVarDecls) ; 

282  int  NofVarDecl (VarDecl  vd) 

283  { 

284  return  ( ( (VarDecl)vd)->rnum) ; 

285  > 

286  VarDecl  VarDeclNo (int  i) 

287  { 

288  return  (VarDecl_obj : :vd_R[i] ) ; 

289  > 

290 

291  void 

292  VarDecl_obj : : init_VarDecl 0 

293  { 

294  for  (int  i  =  0;  i  <  maxNof QptsPerADT;  i++)  { 

295  vd_opts  [i]  =  nil; 

296  } 

297  vd_bestRepr  =  nil; 

298  rnum  =  vd_R. next (this , "VarDecl") ; 

299  vd_as  =  new  AliasSet_obj (this) ; 

300  > 

301 

302  CQNSTRUCTQR(VarDecl_obj ,  Token  t,  ADType  a,  ADTRepr  r) 

303  { 

304  name  =  t ; 

305  vd_ADT  =  a; 

306  vd_repr  =  r;  implemented  =  true; 

307  init_VarDecl 0 ; 

308  > 

309 

310  CQNSTRUCTQR(VarDecl_obj ,  Token  t,  ADType  a) 

311  { 

312 


name  =  t ; 


313  vd_ADT  =  a; 

314  vd_repr  =  nil;  implemented  =  false; 

315  init_VarDecl 0 ; 

316  > 

317 

318  CQNSTRUCTOR(VarDecl_obj ,  Token  t,  ADType  a.  Token  tr) 

319  i 

320  ADTRepr  r  =  new  ADTRepr_obj (tr ,a) ; 

321  if  (tr  !=  dontCareToken  tt  ! Map_in(a->adt_reprs ,  r->name))  { 

322  fatal ("VarDecl  passed  Token  not  in  ADType^s  repr  List"); 

323  > 

324  name  =  t ; 

325  vd_ADT  =  a; 

326  vd_repr  =  r;  implemented  =  true; 

327  init_VarDecl 0 ; 

328  > 

329 

330  CONSTRUCTOR(VarDecl_obj ,  Token  t.  Token  ta) 

331  { 

332  ADType  a; 

333  name  =  t ; 

334  Map_value (ADTs ,  ta,  a);  //  ADTs->value (ta,a) ; 

335  if  (a  ==  nil)  { 

336  a  =  new  ADType_obj (ta) ; 

337  Map_def ine (ADTs ,  ta,  a); 

338  > 

339  vd_ADT  =  a; 

340  vd_repr  =  nil;  implemented  =  false; 

341  init_VarDecl 0 ; 

342  > 

343 

344  boolean 

345  VarDecl_obj : : operator==(VarDecl_obj  fethat) 

346  { 

347  if  (name  !=  that. name)  return  false; 

348  if  (vd_ADT  !=  that.vd_ADT)  return  false; 

349  if  ( ! List_equal (vd_adtParms ,  that . vd_adtParms) )  return  false 

350  return  true; 

351  > 

352 

353  ostreamfe 

354  VarDecl_obj :: print (ostreamfe  font) 

355  { 

356  font  «  name  «  "("  «  vd_ADT->name ; 

357  if  (implemented)  {. 
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358  if  (vd_repr  !=  nil)  font  «  "/"  «  vd_repr->name ; 

359  else  font  «  "/<nil>"; 

360  } 

361  if  (vd_bestRepr  !=  nil)  font  «  "/"  «  vd_bestRepr->name ; 

362  return  fout  «  ")"; 

363  > 

364 

365  ostreamfe 

366  VarDecl_obj : :printForm(ostream&  fout) 

367  { 

368  if  (implemented)  ■{ 

369  if  (vd_repr  !=  nil)  fout  «  vd_repr->name ; 

370  else  fout  «  "<nil>"; 

371  > 

372  if  (vd_bestRepr  !=  nil)  fout  «  "/"  «  vd_bestRepr->name ; 

373  else  fout  «  "("  «  vd_ADT->name  «  ")"; 

374  return  fout  «  "  "  «  name; 

375  > 

376 

377  void 

378  VarDecl_obj : : aliasQf (VarDecl  v) 

379  { 

380  AliasSet  AS  =  v->vd_as; 

381  VarDecl  var; 

382  if  (AS  ==  vd_as)  {. 

383  //  then  this  is  a  redundant  call:  they  are  both  pointing  to  the  same 

384  //  alias  set. 

385  return; 

386  > 

387  forAlKvar,  v->vd_as->as_set , 

388  if  (var  !=  v)  var->vd_as  =  vd_as;  //  because  iterCheck  gets  upset 

389  ) ; 

390  v->vd_as  =  vd_as ; 

391  vd_as->merge (AS) ;  //  also  frees  up  set  AS 

392  > 

393 

394  @@  =====  Signature_obj  ===== 

395 

396  CONSTRUCTOR (Signature_obj) 

397  { 

398  //  do  nothing  special 

399  > 

400 

401  int 

402  Signature_obj : :len() 
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403  { 

404  return  List_length(sig_sig) ; 

405  > 

406 

407 

408  void 

409  Signature_obj : : add(VarDecl  v) 

410  i 

411  List_appendl (sig_sig,  v) ; 

412  > 

413 

414  void 

415  Signature_obj : : add_dontCare () 

416  i 

417  VarDecl  v  =  new  VarDecl_obj (dontCareToken,  dontCareADT,  dontCareToken) 

418  List_appendl (sig_sig,  v) ; 

419  > 

420 

421  void 

422  Signature_obj : : add(Token  vn.  Token  an.  Token  rn) 

423  { 

424  ADType  a; 

425  Map_value (ADTs ,  an,  a); 

426  if  (a  ==  nil)  {. 

427  fatal ("Signature  passed  non_adt  Token"); 

428  > 

429  VarDecl  v  =  new  VarDecl_obj (vn,  a,  rn) ; 

430  List_appendl (sig_sig,  v) ; 

431  > 

432 

433  void 

434  Signature_obj : : add(VarDecl  v.  Token  an.  Token  rn) 

435  { 

436  ADType  a; 

437  Map_value (ADTs ,  an,  a); 

438  if  (a  ==  nil)  {. 

439  fatal ("Signature  passed  non_adt  Token"); 

440  > 

441  List_appendl (sig_sig,  v) ; 

442  y 

443 

444  void 

445  Signature_obj : : add(Token  vn,  ADType  a.  Token  rn) 

446  { 

447  VarDecl  v  =  new  VarDecl_obj (vn,  a,  rn) ; 
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448  List_appendl (sig_sig,  v) ; 

449  > 

450 

451  void 

452  Signature_obj : : add(Token  vn,  ADType  a,  ADTRepr  r) 

453  { 

454  VarDecl  v  =  new  VarDecl_obj (vn,  a,  r) ; 

455  List_appendl (sig_sig,  v) ; 

456  > 

457 

458  ostreamfe 

459  Signature_obj :: print (ostreamfe  font) 

460  { 

461  font  «  "-[Sig: 

462  List_print (sig_sig,  font,  VarDecl_obj_printForm) 

463  return  fout  « 

464  > 

465 

466  ostreamfe 

467  Signature_obj : :printForm(ostream&  fout) 

468  { 

469  fout  « 

470  List_print (sig_sig,  fout,  VarDecl_obj_printForm) 

471  return  fout  « 

472  > 

473 

474 

475  @@  =====  ADType_obj  ===== 

476 

477  int 

478  ADTinfoTableLookup (Token  t) 

479  { 

480  for  (int  i=0;  i  <  nofADTs;  i++)  {. 

481  if  (*t  ==  ADTinf oTable [i] .name)  return  i; 

482  > 

483  cerr  «  t  «  "  :  " ; 

484  fatal ("Unknown  ADT  in  ADTinfoTableLookup"); 

485  > 

486 

487  CONSTRUCTOR (ADType_obj ,  Token  t) 

488  { 

489  name  =  t ; 

490  adt_inited  =  false; 

491  adt_number  =  ADTinf oTableLookup (name) ; 

492  } 


493 

494  CONSTRUCTOR (ADType_obj ,  char  *sp) 

495  { 

496  name  =  new  Token_obj (sp,  id_tkn,  0); 

497  adt_inited  =  false; 

498  adt_number  =  ADTinf oTableLookup(name) ; 

499  > 

500 

501  ADType 

502  isADType (Token  t) 

503  { 

504  ADType  adt ; 

505  Map_value (ADTs ,  t,  adt); 

506  if  (adt  ==  nil)  {. 

507  adt  =  new  ADType_obj  (t) ; 

508  Map_def ine (ADTs ,  t,  adt); 

509  > 

510  return  adt; 

511  > 

512 

513 

514  ostreamfe 

515  ADType_obj : :dump(ostream&  fout) 

516  { 

517  fout  «  "{ADType:  "  «  B00L(adt_inited) 

518  «  "  "  «  name  «  "\nreprs  =  "; 

519  Map_print (adt_reprs ,  fout, 

520  Token_obj_print , 

521  ADTRepr_obj_printName) ; 

522  fout  «  "\nabs.  fens.  =  "; 

523  Set_print (adt_af ens ,  fout,  ADTabsFcn_obj_printName) 

524  return  fout  « 

525  > 

526 

527  ostreamfe 

528  ADType_obj :: print (ostreamfe  fout) 

529  { 

530  return  fout  «  name; 

531  return  fout  « 

532  > 

533 

534  @0  =====  ADTRepr_obj  ===== 

535 

536  CONSTRUCTOR (ADTRepr_obj ,  Token  t,  ADType  a) 

537  { 
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538  name  =  t ; 

539  adtr_of  =  a; 

540  adtr_inited  =  false; 

541  adtr_suffix  =  t->suffix(); 

542  > 

543 

544  CONSTRUCTOR (ADTRepr_obj ,  Token  t) 

545  { 

546  name  =  t ; 

547  adtr_of  =  dontCareADT; 

548  adtr_inited  =  false; 

549  adtr_suffix  =  t->suffix(); 

550  > 

551 

552  //  overload  isADTRepr; 

553 

554  ADTRepr 

555  isADTRepr (Token  t) 

556  { 

557  ADTRepr  adtr; 

558  Map_value (ADTReprs ,  t,  adtr); 

559  if  (adtr  ==  nil)  { 

560  adtr  =  new  ADTRepr_obj  (t) ; 

561  Map_define (ADTReprs ,  t,  adtr); 

562  > 

563  return  adtr; 

564  } 

565 

566  ADTRepr 

567  isADTRepr (Token  t,  ADType  a) 

568  { 

569  ADTRepr  adtr; 

570  Map_value (ADTReprs ,  t,  adtr); 

571  if  (adtr  ==  nil)  {. 

572  adtr  =  new  ADTRepr_obj (t ,a) ; 

573  Map_define (ADTReprs ,  t,  adtr); 

574  > 

575  else  { 

576  if  (*adtr->adtr_of  !=  ^a)  { 

577  fatalC'error  in  isADTRepr (t , a) ") 

578  > 

579  > 

580  return  adtr; 

581  } 

582 
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583 

584  void 

585  ADTRepr_obj : :printName (ostreamfe  fout) 

586  { 

587  fout  «  name; 

588  > 

589 

590  ostreamfe 

591  ADTRepr_obj :: print (ostreamfe  fout) 

592  { 

593  fout  «  "■{ADTRepr_"  «  adtr_suff ix  «  "  :  " 

594  «  B00L(adtr_inited)  «  "  "  «  name  «  "("  «  adtr_of  «  ")"  ; 

595  return  fout  « 

596  > 

597 

598  @0  =====  ADTabsFcn_obj 

599 

600  Registration_obj  ADTabsFcn_obj : : adtaf _R(maxNof ADTabsFcn) ; 

601  int  Nof ADTabsFcn(ADTabsFcn  af) 

602  { 

603  return  ( ( (ADTabsFcn)af )->adtaf_uid) ; 

604  > 

605  ADTabsFcn  ADTabsFcnNo (int  i) 

606  { 

607  return  (ADTabsFcn_obj :: adtaf _R  [i] ) ; 

608  > 

609 

610  CONSTRUCTOR (ADTabsFcn_obj ,  Token  t,  ADType  a) 

611  i 

612  name  =  t; 

613  adtaf_for  =  a; 

614  adtaf_uid  =  adtaf_R.next (this , "ADTabsFcn") ; 

615  > 

616 

617  ostreamfe 

618  ADTabsFcn_obj : :print (ostreamfe  fout) 

619  { 

620  fout  «  " {ADTabsFcn ("  «  name  «  "("  «  adtaf_for  «  ")"  «  adtaf_sig 

621  «  ")\nimpls:"; 

622  Set_print (adtaf _impl_fcns ,  fout,  ADTimpFcn_obj_printName) ; 

623  return  fout  « 

624  > 

625 

626  void 

627  ADTabsFcn_obj : :printName (ostreamfe  fout) 
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628  { 

629  fout  «  name; 

630  > 

631 

632  @0  =====  ADTimpFcn_obj  ===== 

633 

634  Registration_obj  ADTimpFcn_obj : : af d_R(maxNof ADTimpFcn) ; 

635  int  Nof ADTimpFcn (ADTimpFcn  aif) 

636  ■{ 

637  return  ( ( (ADTimpFcn)aif )->afd_uid) ; 

638  > 

639  ADTimpFcn  ADTimpFcnNo (int  i) 

640  { 

641  return  (ADTimpFcn_obj : : af d_R [i] ) ; 

642  > 

643 

644  CONSTRUCTOR (ADTimpFcn_obj ,  Token  t,  ADTabsFcn  af,  ADTRepr  r) 

645  { 

646  name  =  t ; 

647  afd_impl_of  =  af; 

648  repr  =  r; 

649  afd_uid  =  afd_R.next (this , "ADTimpFcn") ; 

650  > 

651 

652  void 

653  ADTimpFcn_obj : :printName (ostreamfe  fout) 

654  { 

655  fout  «  name; 

656  > 

657 

658  ostreamfe 

659  ADTimpFcn_obj :: print (ostreamfe  fout) 

660  { 

661  fout  «  "-{ADTimpFcn:"  «  name  «  "("  «  afd_impl_of  «  ")" 

662  «  afd_sig; 

663  return  fout  « 

664  > 

665 

666  m  =====  ADTcallSite_obj  ===== 

667 

668  @@  a  presumably  invalid  value 

669  #define  IllegalRank  -999999.0 

670 

671  Registration_obj  ADTcallSite_obj : : acs_R(maxNof ADTcallSite) ; 

672  int  Nof ADTcallSite (ADTcallSite  acs) 
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673  { 

674  return  ( ( (ADTcallSite)acs)->acs_ruid) ; 

675  > 

676  ADTcallSite  ADTcallSiteNo (int  i) 

677  { 

678  return  (ADTcallSite_obj : : acs_R  [i] ) ; 

679  > 

680 

681  CQNSTRUCTQR(ADTcallSite_obj , int  Ino ,Profarray  array, int  upid, ADTabsFcn  af) 

682  { 

683  acs_upid  =  upid; 

684  acs_afcn  =  af; 

685  acs_line  =  Ino; 

686  acs_parr  =  array; 

687  implemented  =  false; 

688  acs_ruid  =  acs_R. next (this , "ADTcallSite") ; 

689  acs_rank  =  IllegalRank; 

690  > 

691 

692  double 

693  ADTcallSite_obj : : eval (ADTimpFcn  f) 

694  { 

695  return  (*evalFcns [f->af d_uid] ) (this) ; 

696  > 

697 

698  ostreamfe 

699  ADTcallSite_obj : :print (ostreamfe  font) 

700  i 

701  font  «  "-[ADTcallSite ("  «  acs_upid  «  ")"  «  acs_afcn 

702  «  acs_sig; 

703  return  font  « 

704  > 

705 

706  ostreamfe 

707  ADTcallSite_obj : :printForm(ostream&  font) 

708  { 

709  font  «  acs_af cn->name  «  "  (line  "  «  acs_line  «  "  file  " 

710  «  acs_parr->f ile 0  «  "  p["  «  acs_upid  «  "]=("; 

711  for  (int  i  =  0;  i  <  acs_af cn->nof ProfVars ;  i-^-^)  {. 

712  if  (i  >  0)  font  « 

713  font  «  (*acs_parr)  [acs_upid  i]  ; 

714  > 

715  font  «  ")  "; 

716  if  (implemented)  {. 

717  f  out«implementation->repr->name«"  cost  "«eval  (implementation) ; 
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718  > 

719  else  if  (Betterlmpl  !=  nil)  { 

720  font  «  Betterlmpl->repr->name  «  "  cost  "  «  eval (Betterlmpl) 

721  > 

722  else  -[ 

723  font  «  "profiling  costs  "  «  eval (acs_af cn->evalFcn) ; 

724  > 

725  font  «  ")"; 

726  return  fout; 

727  } 

728 

729  @@  =====  UserFcnDecl_obj  ===== 

730 

731  CONSTRUCTOR(UserFcnDecl_obj ,  int  i.  Token  t) 

732  { 

733  ufd_upid  =  i; 

734  name  =  t; 

735  > 

736 

737  ostreamfe 

738  UserFcnDecl_obj : :print (ostreamfe  fout) 

739  { 

740  fout  «  "{UserFcnDecl ("  «  ufd_upid  «  ")  "  «  name  «  ufd_sig; 

741  return  fout  « 

742  > 

743 

744  @@  =====  UserFcnCall_obj  ===== 

745 

746  CONSTRUCTOR (UserFcnCall_obj ,  int  i,  UserFcnDecl  uf) 

747  i 

748  ufc_upid  =  i; 

749  ufc_decl  =  uf; 

750  > 

751 

752  ostreamfe 

753  UserFcnCall_obj : :print (ostreamfe  fout) 

754  { 

755  fout  «  "-[UserFcnCall ("  «  ufc_upid  «  ")  "  «  ufc_decl  «  ufc_sig 

756  return  fout  « 

757  > 

758 

759  @@  =====  AliasSet_obj  ===== 

760 

761  C0NSTRUCT0R(AliasSet_obj ,  VarDecl  v) 

762  { 
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763  Set_add(as_set ,  v) ; 

764  > 

765 

766  void 

767  AliasSet_obj : :merge (AliasSet  a) 

768  { 

769  Set_unionl (as_set ,a->as_set) ; 

770  delete  a; 

771  > 

772 

773  @@  =====  here  is  the  beginning  of  the  input  routines  for  therblig  ===== 

774 

775  void 

776  readADTs (char  *finame) 

777  { 

778  Token  t; 

779  int  adtrnumber  =  0; 

780  istream  f in(f iname , "r") ; 

781  if  (fin.failO)  fatalC'Could  not  open  file  in  readADTs"); 

782  @0  read  the  names  of  the  abstract  data  types  (ADTs) 

783  fin  »  t; 

784  while  (t  !=  dotToken)  {. 

785  @0  -  <adt>  <adt_i>  ...  , 

786  assert (t  ==  minusToken) ; 

787  fin  »  t; 

788  ADType  adt  =  isADType(t); 

789  if  (adt->adt_inited)  { 

790  fatal ("Duplicate  ADT  types  declared"); 

791  > 

792  @@  read  the  names  of  the  implementations  of  this  adt 

793  @@  the  profiling  implemenation  is  the  ADT  name  appended  with  '_P^ 

794  @0  and  is  created  automatically. 

795  @0  It  is  not  added  to  the  dictionary  for  this  type,  since  it  never 

796  @@  enters  into  the  assignment  computation.  But  it  must  exist  for 

797  @0  printAssignments  procedure  to  work  generally. 

798  adt->adt_prof ilelmpl  =  isADTRepr (t->append("_P") ) ; 

799  adt->adt_prof ilelmpl->adtr_inited  =  true; 

800  adt->adt_prof ilelmpl->adtr_number  =  adtrnumber++; 

801  fin  »  t; 

802  while  (t ! =commaToken)  { 

803  ADTRepr  adti  =  isADTRepr (t) ; 

804  if  (adti->adtr_inited)  { 

805  fatal ("Duplicate  ADT  Reprs  declared"); 

806  > 

807  adti->adtr_number  =  adtrnumber++; 
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808 

809 

810 
811 
812 

813 

814 

815  0® 

816 

817  @@ 

818 

819 

820 
821 
822 

823 

824 

825 

826 

827 

828 

829 

830 

831 

832 

833 

834 

835 

836 

837 

838 

839 

840 

841 

842 

843 

844 

845 

846 

847 

848 

849 

850 

851 

852 


Map_def ine (adt->adt_reprs ,  t,  adti) ; 
adti->adtr_inited  =  true; 
fin  »  t; 

> 

assert (t==commaToken) ; 

@@  read  the  names  of  the  (abstract)  functions  in  the  interface 
@@  now  we  have  one  of  two  kinds  of  lines: 

=  <absfcnname>  <Signature>  , 
or 

<absfcnname>  <impfcnname>  <Signature>  , 
down  to  the  first  semi-colon. 

fin  »  t; 

while  (t ! =semiToken)  { 
if  (t==eqToken)  { 

int  nofProfVars; 

@@  =  <absfcnname>  <nof ProfVars>  <parmtypel>  <parmtype2>  ...  , 
fin  »  t; 

ADTabsFcn  adaf  =  new  ADTabsFcn_obj (t ,  adt) ; 
Set_add(adt->adt_af cns ,  adaf); 
if  (Map_in(ADTaf cns ,  t)) 

fatal ("duplicate  abstract  function  names"); 

Map_def ine (ADTaf cns ,  t,  adaf); 

@@  read  the  number  of  profiling  variables  for  this  function 
fin  »  adaf->nof ProfVars ; 

®®  now  read  the  parameters  to  the  abstract  function 
fin  »  t; 

while  (t ! =commaToken)  { 

VarDecl  v  =  new  VarDecl_obj (dontCareToken,  t) ; 
adaf->adtaf_sig.add(v) ; 
fin  »  t; 

> 

®®  the  profiling  evaluation  functions  must  be  accessible,  but 
they  are  declared  implicitly. 

adaf->evalFcn=new  ADTimpFcn_obj (t ,adaf ,adt->adt_prof ilelmpl) ; 

> 

else  if  (t  ==  plusToken)  {. 

+  optionalName  optionalName  ...  , 
do  { 

fin  »  t; 

y  while  (t  !=  commaToken) ; 

> 

else  {. 

@@  <ADT>  <reprname>  <absfcnname>  <impfcnname>  <type>  <type> 
(1)  (2)  (3)  (4) 
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859 
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866 
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870 

871 
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873 
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875 

876 

877 
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880 
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885 
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889 
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@@  Each  function  is  associated  with  several  names: 

@@  (1)  the  name  of  the  ADT  it  is  an  impl^n  function  for; 

@@  (2)  the  name  of  the  representation  it  is  an  an  impl^n 
@@  function  for; 

@0  (3)  the  name  of  the  abstract  function  it  implements; 

@0  These  must  be  unique  across  all  ADTs. 

00  (4)  the  name  of  the  member  function  by  which  it  is  invoked; 
00  (this  is  finessed  right  now:  all  member  functions 

00  are  invoked  by  the  same  name  as  their  abstract 

@@  function) 

ADType  adt2; 

Map_value (ADTs ,  t,  adt2) ; 
if  (adt2  ==  nil) 

fatal ("Unknown  ADT  in  imp  fen  dcln"); 

@0  read  the  repr  name  (Set_l,  Map_2,  etc.) 

Token  rn; 
fin  »  rn; 

ADTRepr  repr  =  isADTRepr (rn) ; 

@@  read  the  abstract  fen  name; 

ADTabsFcn  adaf2; 

Token  aftok; 
fin  »  aftok; 

Map_value (ADTaf ens ,  aftok,  adaf2) ; 
if  (adaf2  ==  nil) 

fatal ("Unknown  abstract  function  name  in  imp  fen  dcln"); 

@@  read  an  implementation  name  of  an  abstract  function 
fin  »  t; 

ADTimpFcn  fd  =  new  ADTimpFcn_obj (t ,  adaf2,  repr); 

Set_add(adaf 2->adtaf _impl_f ens ,  fd) ; 

@@  read  the  parameters  of  the  implementation  function 
fin  »  t; 

while  (t  !=  commaToken)  { 
if  (t  ==  questToken)  ■{ 

fd->afd_sig.add_dontCare() ; 

> 

else  { 

ADTRepr  adtr  =  isADTRepr (t) ; 

assert (t==dontCareToken  ||  adtr->name ! =dontCareToken) ; 
fd->afd_sig.add(dontCareToken,  adtr->adtr_of ,  adtr); 

> 

fin  »  t; 

> 

} 

@0  this  declaration  line  processed 
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898  fin  »  t; 

899  > 

900  @@  an  adt  group  processed; 

901  adt->adt_inited  =  true; 

902  fin  »  t; 

903  > 

904  @0  all  ADTs  are  read 

905  @@  need  a  check  that  all  ADTs  have  been  inited 

906  @@  and  that  all  ADTReprs  have  been  inited; 

907  ADType  adtelt;  Token  name; 

908  f orAll ( 'name ,  adtelt \  ADTs, 

909  if  ( ! adtelt->adt_inited)  { 

910  error (name->str) ; 

911  fatalC'An  uninitialized  ADT"); 

912  > 

913  ); 

914  ADTRepr  reprelt; 

915  f orAll ( 'name ,  reprelt \  ADTReprs, 

916  if  ( !  reprelt->adtr_inited)  {. 

917  error (name->str) ; 

918  fatalC'An  uninitialized  ADT  representation"); 

919  > 

920  ) ; 

921  > 

922 

923  void 

924  readVarDecls (Token  ftok,  char  *finame) 

925  { 

926  Token  var; 

927  openFile (f in, i , "r" ,f iname , true) ; 

928  fin  »  var; 

929  while  (var  !=  dotToken)  { 

930  Token  typ,parm; 

931  int  cnt  =  0; 

932  fin  »  typ; 

933  @0  assert (isanadt (typ) ) ; 

934  @0  read  all  the  variable  names  declared  in  this  program 

935  ADType  adt3; 

936  Map_value (ADTs ,  typ,  adt3) ; 

937  VarDecl  vd  =  new  VarDecl_obj (var ,  adt3) ; 

938  fin  »  parm; 

939  while  (parm  !=  commaToken  && 

940  cnt  <  ADTinf  oTable  [adt3->adt_number]  .nofReqd)  ■{ 

941  List_appendl (vd->vd_adtParms ,  parm); 

942  fin  »  parm; 
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943 

944 

945 

946 

947 

948 

949 

950 

951 

952 

953 

954 

955 

956 

957 

958 

959 

960 

961 

962 

963 

964 

965 

966 

967 

968 

969 

970 

971 

972 

973 

974 

975 

976 

977 

978 

979 

980 

981 

982  > 

983 

984  void 


cnt++; 

> 

assert(cnt  ==  ADTinf oTable  [adt3->adt_number] .nofReqd) ; 

@@  just  to  be  on  the  safe  side,  we ^11  check  that  if  the  var  name 
@@  is  already  defined,  its  declaration  matches  exactly  what  we 
@0  have  already  defined, 
if  (Map_in(Vars ,  var))  ■{ 

VarDecl  vddefd; 

Map_value (Vars ,  var,  vddefd); 
if  (*vd  !=  *vddefd)  { 

cerr  «  vd  «  "\n"  «  vddefd  «  "\n"; 

fatal ("Duplicate  declarations  of  variable  do  not  match") 

> 

delete  vd; 

> 

else  {. 

Map_def ine (Vars ,  var,  vd) ; 

> 

//  is  also  the  place  to  read  the  optionals 
//  parse_optionals (vd,  ADTinf oTable [adt3->adt_number] . tbl) ; 

//  the  optionals  are  of  the  form  dd  or  dd=token 
optionType  *tbl  =  ADTinf oTable  [adt3->adt_number] . tbl ; 
while  (parm  !=  commaToken)  { 

//  the  parm  better  be  an  integer 
int  idx  =  parm->integer 0 ; 
fin  »  parm; 
if  (parm  ==  eqToken)  -[ 
fin  »  parm; 
vd->vd_opts [idx]  = 

new  0ptional_obj (parm,  tbl  [idx] . type ,  idx,  tbl); 
fin  »  parm; 

> 

else  {. 

vd->vd_opts [idx]  = 

new  0ptional_obj (nil ,  tbl [idx] . type ,  idx,  tbl); 

> 

> 

fin  »  var; 

> 


985  readUserFcnDecls (Token  ftok,  char  *finame) 

986  { 

987  Token  t; 
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988 

989 

990 

991 

992 

993 

994 

995 

996 

997 

998 

999 
1000 
1001 
1002 

1003 

1004 

1005 

1006 

1007 

1008 

1009 

1010 
1011 
1012 

1013 

1014 

1015 

1016 

1017 

1018 

1019 

1020 
1021 
1022 

1023 

1024 

1025 

1026 

1027 

1028 

1029 

1030 

1031 

1032 


openFile (f in, i , "r" ,f iname , true) ; 
fin  »  t; 

while  (t ! =dotToken)  { 

int  num  =  t->integer () ; 

Token  fcnnamet,  varnamet; 
fin  »  fcnnamet; 

UserFcnDecl  ufd  =  new  UserFcnDecl_obj (num,  fcnnamet); 
assertC !Map_in(UserFcns, fcnnamet)) ; 

Map_def ine (UserFcns , fcnnamet , ufd) ; 

@@  read  the  parm  types 

fin  »  varnamet; 

while  (varnamet ! =commaToken)  { 

Token  vartypet; 

VarDecl  var; 
fin  »  vartypet; 
assert (vartypet  !=  commaToken) ; 
if  (varnamet  ==  dontCareToken)  {. 

assert (vartypet  ==  dontCareToken); 
ufd->ufd_sig.add_dontCare() ; 

> 

else  {. 

Map_value (Vars ,  varnamet ,  var) ; 
if  (var  ==  nil)  { 

cerr  «  "Undeclared  variable  read  in  readUserFcnDecls 
«  varnamet  «  "\n"; 
exit  (1) ; 

> 

ufd->ufd_sig.add(var,  vartypet,  dontCareToken); 

> 

fin  »  varnamet; 

> 

fin  »  t; 

> 


void 

readADTcallSites (Token  ftok,  char  *f iname) 

{.  @@  call  sites  of  calls  on  functions  in  the  interface  of  an  adt 
Token  t; 

Profarray  prof ileArray ; 
openFile (f in, i , "r" ,f iname , true) ; 
fin  »  t; 

Map_value (Prof Arrays ,  ftok,  prof ileArray) ; 
while  (t !  =dotToken)  {. 

Iterator  apli; 
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1033 

1034 

1035 

1036 

1037 

1038 

1039 

1040 

1041 

1042 

1043 

1044 

1045 

1046 

1047 

1048 

1049 

1050 

1051 

1052 

1053 

1054 

1055 

1056 

1057 

1058 

1059 

1060 
1061 
1062 

1063 

1064 

1065 

1066 

1067 

1068 

1069 

1070 

1071 

1072 

1073 

1074 

1075 

1076 


Token  nt; 

ADTabsFcn  aafl; 

int  profnum  =  t->integer () ; 

fin  »  t; 


int  linenum  =  t->integer () ; 
fin  »  nt; 

Map_value (ADTaf cns ,  nt,  aafl); 
assert (aafl  !=  nil); 

ADTcallSite  afc  = 

new  ADTcallSite_obj (linenum,  prof ileArray ,  profnum,  aafl); 

@@  read  the  actual  parms 
fin  »  t; 

List_iterlnit (aaf l->adtaf_sig. sig_sig,  apli) ; 
while  (t  !=  commaToken)  {. 

VarDecl  vd3; 

VarDecl  abvd3; 

assert( !List_iterDone(aaf l->adtaf_sig.sig_sig,  apli)) ; 
List_iterate (aaf l->adtaf_sig. sig_sig,  apli,  abvd3) ; 

Map_value (Vars ,t ,vd3) ; 

@@  if  the  variable  is  not  found,  it  is  a  dontCare; 

@@  That  is,  there  is  not  a  dontCareVar; 

@0  this  can  be  confirmed  by  seeing  if  the  corresponding  parameter 
@0  of  the  abstract  function  is  a  dont  care.  Otherwise,  error; 
if  (vd3  ==  nil)  { 

if  (abvd3  !=  nil)  { 

if  (abvd3->vd_ADT  ==  dontCareADT)  { 
afc->acs_sig.add_dontCare() ; 

> 

else  {. 

fatal ("Unrecognized  var  in  call  site  parm  list"); 

> 

> 

else  {. 

f at al ("List _ iterate (aaf l->adtaf_sig. sig_sig, apli ,abvd3) 


> 

else  { 

assert (abvd3->vd_ADT  !=  dontCareADT); 
assert (vd3->name  !=  dontCareToken) ; 
af c->acs_sig.add(vd3) ;  ON  type  to  be  computed; 
Set_add(vd3->vd_inSigs0f ,  afc); 

> 


fin  »  t; 

> 
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1077 

1078 

1079 

1080 
1081 
1082 

1083 

1084 

1085 

1086 

1087 

1088 

1089 

1090 

1091 

1092 

1093 

1094 

1095 

1096 

1097 

1098 

1099 

1100 
1101 
1102 

1103 

1104 

1105 

1106 

1107 

1108 

1109 

1110 
1111 
1112 

1113 

1114 

1115 

1116 

1117 

1118 

1119 

1120 
1121 


List_iterCleanup(aaf l->adtaf_sig. sig_sig,  apli) ; 
Set_add(ADTcalls ,af c) ; 
fin  »  t; 

> 


void 

readUserFcnCalls (Token  ftok,  char  *finame) 

{.  @0  call  sites  of  calls  on  (interesting)  user  functions 
Token  t; 

VarDecl  fcnParm; 

openFile (f in, i , "r" ,f iname , true) ; 
fin  »  t; 

while  (t !  =dotToken)  {. 

int  num  =  t->integer () ; 

Token  nt; 
fin  »  nt; 

UserFcnDecl  ufd3; 

Map_value (UserFcns ,nt ,uf d3) ;  @0  must  exist; 
assert (ufd3  !=  nil); 

UserFcnCall  ufc  =  new  UserFcnCall_obj (num,  ufd3) ; 

Iterator  sigiter; 

Signature_obj&  formalParms  =  uf c->uf c_decl->uf d_sig; 

List_iterlnit (formalParms . sig_sig,  sigiter) ; 

@@  read  the  actual  parms 
fin  »  t; 

while  (t ! =commaToken)  { 

@@  assert((t  ==  dontCare)  ||  isavar(t)); 

VarDecl  vd4; 

Map_value (Vars ,  t ,  vd4) ; 
uf c->uf c_sig.add(vd4) ; 

//  alias  this  variable  with  the  formal 
List_iterate (formalParms . sig_sig,  sigiter,  f cnParm) ; 
if  (vd4  ==  nil)  {. 

assert (fcnParm  !=  nil  &&  f cnParm->name  ==  dontCareToken) ; 

> 

else  { 

vd4->aliasQf (fcnParm) ; 
if  (DebugDetails)  {. 

cerr  «  "»>Aliases  for  " 

«  hex( (int)vd4)  «  "  "  «  vd4->name  « 

Set_print (vd4->vd_as->as_set , cerr ,VarDecl_obj_printForm) 

cerr  «  "\n" ; 

cerr  «  "»>Aliases  for  " 

«  hex( (int) fcnParm)  «  "  "  «  f cnParm->name  « 
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1122 

1123 

1124 

1125 

1126 

1127 

1128 

1129 

1130 

1131 

1132 

1133 

1134 

1135 

1136 

1137 

1138 

1139 

1140 

1141 

1142 

1143 

1144 

1145 

1146 

1147 

1148 

1149 

1150 

1151 

1152 

1153 

1154 

1155 

1156 

1157 

1158 

1159 

1160 
1161 
1162 

1163 

1164 

1165 


Set_print (f cnParm->vd_as->as_set , cerr , 

VarDecl_obj_printForm) ; 

cerr  «  "\n" ; 

> 

> 

fin  »  t; 

> 

fin  »  t; 

List_iterCleanup(f ormalParms . sig_sig,  sigiter) ; 

> 

> 

void 

readProf ileData(Token  ftok,  char  *finame) 

{ 

if  ( ! prof ileDataValid)  return; 

@0  Form  of  profile  data  input: 

@0  It  is  an  array  of  32-bit  integers.  Each  slot  in  the  array 
@0  corresponds  to  a  profile  variable  declared  in  the  ADT  profiling 
@0  implementations,  or  to  the  static  call  site  of  an  adt  function  or 
@@  user  function.  The  number  read  with  each  function  in 
@0  readprogramdesc  is  the  beginning  location  of  the  profile 
@0  variables  for  all  functions.  E.g.  if  fl  collects  5  profile 
@0  variables,  and  fls  number  is  15,  then  locations  15  through  19 
@0  are  the  locations  in  the  profile  array  of  fls  profile  variables. 

@0  these  variables  are  not  accessed  by  anything  here  except  that  the 
@@  first  location  of  each  function  read  is  its  execution  frequency  count. 
@@  (this  is  a  required  convention  and  is  not  currently  enforced  by 
@@  any  software  checks,  this  is  easy  to  fix  by  always  declaring 
@@  p_cnt  for  ADT  interface  functions.) 

@@  Note  that  we  are  talking  about  the  unique  profile  id  (upid) ,  not  the 
@0  function  id  (fid) . 

@0  If  the  profile  data  file  does  not  exist,  then  our  choices  are  simple: 
@0  all  implementations  are  profiling  (*_P)  implementations. 

@0  New  requirement:  the  last  entry  of  each  profile  array  is  the  number 
@0  of  times  that  profile  array  was  written  to.  When  we  read  the  array  in 
@0  it  is  converted  from  long  to  double,  with  each  entry  divided  by  the 
@@  execution  count, 
long  size; 

FILE  *pfile  =  f open(f iname ,  "r"); 
if  (pfile  ==  0)  {. 

prof ileDataValid  =  false; 
if  (DebugFiles)  { 

cerr  «  "  Profile  file  "  «  f iname  «  "  not  there. \n"; 

} 
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1166  return; 

1167  > 

1168  fread((char  *)&size,  sizeof (long) ,  1,  pf ile) ; 

1169  Prof array  pa  =  new  Prof array_obj (pf ile ,  ftok,  size); 

1170  if  ( ! pa->valid() )  { 

1171  prof ileDataValid  =  false; 

1172  return; 

1173  } 

1174  Map_define (Prof Arrays ,  ftok,  pa); 

1175  f close (pf ile) ; 

1176  return; 

1177  } 

1178 

1179  @@  =============================== 

1180  @0  end  of  input  section  of  program 

1181  @(§  =============================== 

1182 

1183  @0  main  section 

1184 

1185  @@  global  variables 

1186  DECLARE(sortedCallSites ,  List,  ADTcallSite) ; 

1187 

1188  @@  little  functions 

1189 

1190  boolean 

1191  mapsto (ADTRepr  r,  ADType  t) 

1192  { 

1193  @@  can  t  be  implemented/represented  by  r 

1194  if  (DebugDetails) 

1195  cerr  «  "»>mapsto (ADTRepr  "  «  r  «  "::ADType  "  «  t  «  ")\n"; 

1196  /*  if  (r  ==  dontCareRepr)  { 

1197  if  (t  ==  dontCareADT)  return  true; 

1198  else  return  false; 

1199  } 

1200  else  if  (t  ==  dontCareADT)  return  false; 

1201  —  the  above  is  done  by  have  a  pseudo-ADT  called  dontCareADT,  with 

1202  —  one  don^t  care  ADTRepr  dontCareRepr 

1203  */ 

1204  return  Map_in(t->adt_reprs ,r->name) ; 

1205  } 

1206 

1207  boolean 

1208  mapsto (VarDecl  actual,  VarDecl  formal) 

1209  { 

1210  assert (formal->implemented) ; 
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1211 

1212 

1213 

1214 

1215 

1216 

1217 

1218 

1219 

1220 
1221 
1222 

1223 

1224 

1225 

1226 

1227 

1228 

1229 

1230 

1231 

1232 

1233 

1234 

1235 

1236 

1237 

1238 

1239 

1240 

1241 

1242 

1243 

1244 

1245 

1246 

1247 

1248 

1249 

1250 

1251 

1252 

1253 

1254 

1255 


if  (DebugDetails) 

cerr  «  "»>mapsto  (VarDecl  "  «actual«"  :  :  VarDecl  "«formal«")\n"  ; 
if  (actual->implemented)  { 

return  (*actual->vd_repr  ==  *f ormal->vd_repr) ; 

> 

else  return  mapsto (f ormal->vd_repr ,  actual->vd_ADT) ; 

> 

boolean 

mapsto (Signature_obj  feactual,  Signature_obj  feformal) 

@@  do  the  types  of  the  variables  map  to  the  abstract  types  in  the  first 
@@  parameter  List,  they  map  if  each  parameter  maps, 
if  (actual . len()  !=  formal . len() )  return  false; 

Iterator  nexta,  nextf; 

VarDecl  av,  fv; 
if  (DebugDetails) 

cerr  «  "»>mapsto (Sig  "  «  actual  «  "::Sig  "  «  formal  «  ")\n"; 
List_iterlnit (actual . sig_sig,  nexta) ; 

List_iterlnit (formal . sig_sig,  nextf) ; 

while  (List_iterate (actual . sig_sig,  nexta,  av)  && 

List_iterate (formal . sig_sig,  nextf,  fv)) 

{ 

boolean  r_mapsto,  r_feas; 

if  ( ! (r_mapsto  =  mapsto (av,  fv)) 

II  (r_feas  =  (fv->vd_repr  !=  dontCareRepr 

&&  ! (*f easibilityFcns [fv->vd_repr->adtr_number] ) (av) ) ) )  -[ 
List_iterCleanup(actual.sig_sig,  nexta) ; 
List_iterCleanup(formal.sig_sig,  nextf) ; 
if  (DebugDetails)  -[ 

cerr  «  "  »>Doesn^t  map  because 
if  (!r_mapsto)  { 

cerr  «  "the  actual  does  not  map  to  the  f ormal\n" ; 

> 

else  { 

assert (r_f eas) ; 

cerr  «  av->name  «  "  cannot  be  implemented  by  " 

«  fv->vd_repr->name  «  "\n"; 

> 

> 

return  false; 

> 


> 

List_iterCleanup(actual.sig_sig,  nexta) ; 
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1256  List_iterCleanup(formal.sig_sig,  nextf ) ; 

1257  return  true; 

1258  > 

1259 

1260  //  overload  isCompatible ; 

1261 

1262  boolean 

1263  isCompatible (ADTimpFcn  fi,  ADTcallSite  c) 

1264  { 

1265  if  (DebugAssign) 

1266  cerr  «  "»>isCompatible ("  «  fi  «  «  c  «  ")  == 

1267  if  (mapsto (c->acs_sig,  f i->af d_sig) )  {. 

1268  if  (DebugAssign)  cerr  «  "true\n"; 

1269  return  true; 

1270  > 

1271  else  { 

1272  if  (DebugAssign)  cerr  «  "false\n"; 

1273  return  false; 

1274  > 

1275  > 

1276 

1277  DeclareUserFcn(f indCompatibleImplementations ,  ?,  ?,  Ic,  Set)@; 

1278 

1279  void 

1280  f indCompatibleImplementations (ADTcallSite  &c, 

1281  DeclareParmdc ,  Set,  ADTimpFcn,  afd_bminfo)) 

1282  {  @@  find  all  implementations  of  c->acs_afcn  compatible  with  the 

1283  @@  parameters  in  call  site  c; 

1284  ADTimpFcn  fi; 

1285  Set_makeEmpty (Ic) ; 

1286  forAlKfi,  c->acs_af cn->adtaf _impl_f cns , 

1287  if  (isCompatible  (fi ,  c))  -[ 

1288  Set_add(Ic,  f i) ; 

1289  > 

1290  ) ; 

1291  return; 

1292  > 

1293 

1294  DeclareUserFcn(callSitesContaining,  ?,  ?,  callSitesp,  Set)@; 

1295 

1296  void 

1297  callSitesContaining(VarDecl  v,  DeclareParm(callSitesp,  Set,  ADTcallSite, 

1298  acs_bminfo)) 

1299  { 

1300  @0  union  is  overkill:  callSitesp  is  always  empty.  An  interesting 
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1301  @0  data  point  for  therblig  analysis,  though. 

1302  Set_unionl (callSitesp,  v->vd_inSigsQf ) ; 

1303  > 

1304 

1305  void 

1306  assignlmplType (VarDecl  v,  ADTRepr  t) 

1307  { 

1308  assert (! v->implemented) ; 

1309  if  (DebugDetails) 

1310  cerr  «  "»>assignImplType :  "  «  v  «  "  <-  "  «  t  «  "\n"; 

1311  v->vd_repr  =  t; 

1312  v->implemented  =  true; 

1313  > 

1314 

1315 

1316  void 

1317  unassignlmplType (VarDecl  v) 

1318  { 

1319  assert (v->implemented) ; 

1320  v->implemented  =  false; 

1321  > 

1322 

1323  boolean 

1324  implementable (VarDecl  callSiteVar,  ADTRepr  r) 

1325  { 

1326  @0  The  inherent  feasibility  of  assigning  callSiteVar  the 

1327  @@  representation  r  was  checked  in  isCompatible . 

1328  assert ( (*f easibilityFcns [r->adtr_number] ) (callSiteVar) ) ; 

1329  0® 

1330  @@  if  callSiteVar  is  assigned  the  impl.  type  impFcnFormalVar->vd_repr , 

1331  @®  then  for  every  call  site  c  that  has  callSiteVar  in  its  actual  parameter 

1332  ®®  List,  check  that  there  still  exists  AT  LEAST  ONE  impl^n  function 

1333  ®®  that  can  be  used  to  implement  the  function  called  at  c. 

1334  @®  This  check  is  not  absolutely  necessary,  but  I  suspect  it  may  cut 

1335  ®®  down  on  the  amount  of  backtracking. 

1336  ®®  Side  effect:  callSiteVar  is  assigned  impln 

1337  ®®  impFcnFormalVar->vd_repr ,  if  feasible. 

1338  ®® 

1339  DECLARE(callSites ,  Set,  ADTcallSite,  acs_bminfo) ; 

1340  ADTcallSite  c; 

1341  ®®  by  defn  of  f indCompatibleImplementations : 

1342  assert (mapsto (r,  callSiteVar->vd_ADT) ) ; 

1343  assignlmplType (callSiteVar ,  r) ; 

1344  if  (DebugAssign)  {. 

1345  cerr  «  "»>trying  "  «  r  «  "  for 


«  callSiteVar->name  «  "\n"; 
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1346  > 

1347  CallUserFcn(callSitesGontaining,  callSiteVar,  callSites)®; 

1348  callSitesContainingCcallSiteVar ,  callSites) ; 

1349  if  (DebugAssign)  ■{ 

1350  cerr  «  "»>callSitesContaining: 

1351  Set_print (callSites ,  cerr,  ADTcallSite_obj_printName) ; 

1352  cerr  «  "\n"; 

1353  > 

1354  if  (Set_empty (callSites) )  return  true; 

1355  @@  for  each  call  site 

1356  Iterate (next ,  c,  callSites, 

1357  ADTimpFcn  fi; 

1358  Iterate (nextfi ,  fi,  c->acs_af cn->adtaf _impl_f cns , 

1359  if  (mapsto (c->acs_sig,  f i->af d_sig) )  {. 

1360  Set_iterCleanup(c->acs_af cn->adtaf_impl_f cns ,  nextfi); 

1361  goto  SUCCESS; 

1362  > 

1363  )  ; 

1364  unassignlmplType (callSiteVar) ; 

1365  if  (DebugAssign)  {. 

1366  cout  «  "No  implementations  for  "  «  c->acs_afcn  «  "\n"; 

1367  > 

1368  Set_iterCleanup(callSites ,  next); 

1369  return  false; 

1370  SUCCESS:  ; 

1371  ); 

1372  return  true; 

1373  > 

1374 

1375  DeclareUserFcn(parmsImplementable ,  ?,  ?,  ?,  ?,  changedp,  Set)@; 

1376 

1377  boolean 

1378  parmsimplementable (ADTcallSite  c,  ADTimpFcn  f, 

1379  DeclareParm(changedp,  Set,  VarDecl,  vd_bminfo)) 

1380  ■{ 

1381  @@  Check  that  the  impl.  fen  f  can  be  used  to  implement  the  fen 

1382  @0  called  at  call  site  c  by  checking  that  the  variables  in  the  actual 

1383  @@  parameter  List  to  c  can  be  assigned  the 

1384  @@  impl.  types  required  by  the  formal  parms  of  f.  Note  that  the 

1385  @@  parallel  iteration  over  the  formal  parms  of  f  and  the 

1386  @@  actual  parms  of  c  works  by  the 

1387  @@  definition  of  the  function  f indCompatibleImplementations . 

1388  @0  Refinement:  this  check  has  to  be  performed  for  the  actual  var  and  every 

1389  @0  variable  to  which  it  is  aliased  in  every  function  call  in  which  they 

1390  @@  occur. 
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1391 

1392 

1393 

1394 

1395 

1396 

1397 

1398 

1399 

1400 

1401 

1402 

1403 

1404 

1405 

1406 

1407 

1408 

1409 

1410 

1411 

1412 

1413 

1414 

1415 

1416 

1417 

1418 

1419 

1420 

1421 

1422 

1423 

1424 

1425 

1426 

1427 

1428 

1429 

1430 

1431 

1432 

1433 

1434 

1435 


assert (Set_empty (changedp) ) ; 

Iterator  nextf;  @(§  points  to  the  implementation  formal 

VarDecl  impforml; 

Iterator  nextv;  @0  points  to  the  call  site  actual 

VarDecl  actual; 
if  (DebugAssign) 

cerr  «  "»>parmslmplementable : "  «  c->acs_sig  «  f->afd_sig  «  "\n"; 
List_iterlnit (f->af d_sig. sig_sig,  nextf) ; 

List_iterlnit (c->acs_sig. sig_sig,  nextv) ; 

while  (List_iterate (f->afd_sig. sig_sig,  nextf,  impforml)  && 

List_iterate (c->acs_sig. sig_sig,  nextv,  actual)) 

{ 

assert (impf orml->implemented) ;  @0  it  is  a  formal  impln  parm 
if  ( !  actual->implemented)  {. 

II  for  each  variable  aliased  to  actual, 

//  for  each  call  site  using  that  variable, 

//  see  if  it  is  implementable  using  the  type  of  impformal 

VarDecl  aliasv; 

MALLQGK; 

if  (DebugAssign)  { 

cerr  «  "»>Aliases  for  "  «  actual->name  « 

Set_print (actual->vd_as->as_set ,  cerr,  VarDecl_obj_printForm) ; 
cerr  «  "\n" ; 

> 

f orAll (aliasv,  actual->vd_as->as_set , 
if  (DebugAssign) 

cerr  «  "»>Alias  loop:  "  «  aliasv  «  "\n"; 
if  ( ! aliasv->implemented)  { 

if  (implementable (aliasv,  impf orml->vd_repr) )  { 

//  implementable  assigns  implementation  to  actual 
if  (DebugAssign)  { 

cout  «  "Implementing  "  «  aliasv->name  « 

"  as  "  «  impf orml->vd_repr->name  «  "\n"; 

> 

Set_add(changedp,  aliasv); 

> 

else  { 

List_iterCleanup(f->afd_sig. sig_sig,  nextf) ; 
List_iterCleanup(c->acs_sig. sig_sig,  nextv) ; 
if  (DebugAssign)  ■{ 

cout  «  "Could  not  implement  "  «  aliasv->name 
«  "  as  "  «  impf orml->vd_repr  «  "\n"; 

> 

return  false; 

> 
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1436  > 

1437  )  ; 

1438  > 

1439  > 

1440  List_iterCleanup(f->afd_sig. sig_sig,  nextf ) ; 

1441  List_iterCleanup(c->acs_sig. sig_sig,  nextv) ; 

1442  return  true; 

1443  > 

1444 

1445  DeclareUserFcn(undoImplementations ,  ivars.  Set,  ?,  ?)@; 

1446 

1447  void 

1448  undoimplementations (DeclareParmCivars ,  Set,  VarDecl,  vd_bminfo) ,  int  index) 

1449  { 

1450  @@  undo  the  implementations  of  the  variables  in  ivars 

1451  VarDecl  v; 

1452  forAlKv,  ivars, 

1453  if  (DebugAssign){ 

1454  cout  «  "  Undoing  "  «  v->name  «  "  "  «  index  «  "\n"; 

1455  > 

1456  unassignlmplType (v) ; 

1457  ) ; 

1458  > 

1459 

1460  @0  the  call  site  of  interest  in  the  current  invocation  of  assignable 

1461  @0  an  unfortunately  necessary  global 

1462 

1463  ADTcallSite  curCallSite; 

1464 

1465  int 

1466  compareCosts (ADTimpFcn  fl,  ADTimpFcn  f2) 

1467  { 

1468  @0  compare  the  resource  costs  of  the  two  functions 

1469  //  these  computations  should  be  cached  in  the  call  site  (e.g.  a  list 

1470  //  of  costs  for  each  possible  afd_uid) . 

1471  double  fir,  f2r; 

1472  fir  =  curCallSite->eval (f 1) ; 

1473  f2r  =  curCallSite->eval (f 2) ; 

1474  if  (DebugCosts)  { 

1475  cerr  «  "CallsiteC  «  curCallSite->acs_upid  «  ") : 

1476  f l->typedName (cerr) ; 

1477  cerr  «  "="  «  fir  «  "  and 

1478  f 2->typedName (cerr) ; 

1479  cerr  «  "  =  "  «  f2r  «  "\n"; 

1480  > 
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if  (fir  <  f2r)  return  -1; 
if  (fir  >  f2r)  return  1; 
return  0; 

> 

void 

printSortedCallSites (boolean  better) 

ADTcallSite  cs; 
if  (DebugSortCallSites)  { 

cout  «  "\nSorted  call  sites :\n"; 
forAlKcs,  sortedCallSites , 
cout  «  cs  «  "\n"; 
if  (better) 

cs->betterlmpl 0 ; 

); 

cout  «  "\n"; 

> 

> 

void 

RecordCurAssignments  0 

{ 

VarDecl  var; 

Token  t; 

printSortedCallSites (true) ; 
forAll('t,var\  Vars, 

if  (t  !=  dontCareToken)  { 

cout  «  "Implemented  "  «  var->name 
«  "  as  "  «  var->vd_repr->name ; 
if  (var->vd_bestRepr ! =nil  &&  var->vd_bestRepr ! =var->vd_repr) 
cout  «  "  (vs.  "  «  var->vd_bestRepr->name  «  ")"; 

> 

cout  «  "\n"; 
var->betterRepr 0 ; 

> 

); 

> 

boolean 

assignable (  Iterator  iter,  int  listindex,  double  cost  ) 

{ 

@@  take  the  next  call  site  c,  and  assign  it  the  cheapest 
@0  implementation  you  can.  whether  c  can  be  assigned  the  cheapest 
@@  implementation  is  determined  by  parmsimplementable . 
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1565 

1566 

1567 

1568 

1569 

1570 


ADTcallSite  c; 

ADTimpFcn  fi;  @@  a  candidate  implementation  of  this  abs .  fen 

DECLARE ( implSet ,  Set,  ADTimpFcn,  afd_bminfo) ;  00  Set  of  impl^n  fens 

@@  compatible  with  ADTcallSite  c 
DECLARE(implList ,  List,  ADTimpFcn) ; @0  implSet  sorted; 
boolean  worked; @0  true  if  parmsimplementable  succeeded; 
assert(cost  >=  0); 

00  prune  this  search  branch  if  we  are  already  too  costly; 
if  (curAssignCost  !=  -1.0  &&  cost  >=  curAssignCost)  ■{ 
cout  «  ".Pruned  at  "  «  listindex  «  ".\n"; 
return  false; 

> 

cout  «  «  listindex; 

if  (List_iterDone (sortedCallSites ,  iter))  { 

//  then  we  are  at  the  bottom  of  the  file, 
if  (curAssignCost  ==  -1.0  ||  cost  <  curAssignCost)  ■{ 
RecordCurAssignments 0 ; 

if  (curAssignCost  ==  -1.0)  cout  «  "First  "; 
else  cout  «  "Better  "; 

cout  «  "implementation:  "  «  f ormC'^lO . 2f " , cost) ; 
if  (curAssignCost  !=  -1.0)  -[ 

double  delta  =  (curAssignCost  -  cost  )  /  curAssignCost; 
cout  «  "  delta="  «  form("yol0.8f" , delta) ; 

> 

cout  «  "\n"; 
curAssignCost  =  cost; 

> 

else  ■{ 

cout  «  "Not  better  implementation:  "  «  cost  «  "\n"; 

> 

return  true; 

> 

if  (!  List_iterate  (sortedCallSites ,  iter ,  c) )  {. 
assert (false) ; 

> 

if  (DebugAssign)  { 

cerr  «  "»>L00P:  assignable  call  site:  "  «  c  «  "\n"; 

> 

MALLOCK; 

CallUserFcn(f indCompatibleImplementations ,  c,  implSet)®; 
f indCompatibleImplementations (c ,  implSet) ; 
if  (DebugAssign)  { 

cerr  «  "»>Compatible  implementations:  "; 

Set_print (implSet ,  cerr,  ADTimpFcn_obj_print) ; 
cerr  «  "\n" ; 
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> 

curCallSite  =  c; 

Set_sort2 (implSet ,  implList,  fecompareCosts) ; 

@@  for  each  implementation  fi  for  c,  assign  the  types  implied  by 
@0  fi'^s  Signature  to  the  variables  for  c. 

@0  if  this  assignment  is  feasible,  then  recurse  and  try  the  next 
@0  call  site  on  the  List,  if  the  recursion  returns  true,  then 
@0  return  true,  else  the  assignment  is  not  feasible. 

@0  if  the  assignment  is  not  feasible  try  the  next  implementation  fen. 
@@  if  no  impl  fens  are  feasible  then  return  false. 

Iterate (next ,  fi,  implList, 

@@  the  Set  of  all  variables  assigned 
DECLARE (changed.  Set,  VarDecl,  vd_bminfo) ; 

@0  on  a  call  to  parmsimplementable . 
if  (DebugAssign)  -[ 

cerr  «  "»>assignable  loop  on  "  «  fi  «  "  index  "  « 
listindex  «  "\n"; 

> 

CallUserFcn(parmsImplementable ,  c,  fi,  changed); 
worked  =  parmsimplementable (c ,  fi,  changed); 
if  (DebugAssign)  ■{ 

cerr  «  "»>parmslmplementable :  "  «  BQQL(worked)  «  "\n"; 

> 

if  (worked)  {. 

Iterator  iter2; 

List_iterCopy (sortedCallSites ,  iter,  iter2) ; 

c->implement (f i) ; 

double  fenCost  =  c->eval(); 

if  (isnan(f enCost)  ||  isinf (f enCost)  ||  fenCost  <  0.0)  { 
cerr  «  "Evaluation  function  problem :\n"; 
cerr  «  "The  fen  for  "«f i->repr->name«f i->name«"\n" 
cerr  «  "fubarred.  It  returned  "  «  fenCost  «  "\n"; 
exit (1) ; 

> 

if  (assignable (iter2 ,  listIndex+1,  cost  +  c->eval()))  { 

//  we^re  walking  back  up  the  list  toward  the  more 
//  important  call  sites; 
if  (listindex  <  cutOffIndex)  {. 

II  then  we  want  to  try  alternatives; 
List_iterCleanup(sortedCallSites,  iter2) ; 
CallUserFcn(undoImplementations , changed, listindex) 
undoimplementations (changed,  listindex) ; 
c->unimplement 0 ; 

continue;  @@  very  important:  continues  the  forAll! 

> 
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1645 

1646  // 

1647 


else  {. 

II  then  we  won'^t  try  alternatives; 
List_iterCleanup(sortedCallSites,  iter2) ; 
List_iterCleanup(implList ,  next) ; 
if  (DebugAssign  |  |  DebugCosts)  ■{ 

cerr«"»>assigned  "«  c  «"  "«  f i->repr->name 
«  cost:  "  «  c->eval(fi)  «  "\n"; 

> 

return  true; 

> 

> 

else  ■{ 

CallUserFcn(undoImplementations , changed, listindex) ; 
undoimplementations (changed, listindex) ; 
c->unimplement 0 ; 

> 

List_iterCleanup(sortedCallSites ,  iter2) ; 

> 

else  { 

if  (DebugAssign)  -[ 

cout  «  "Did  not  implement  call  site  "  «  c  «  "\n"; 

> 

CallUserFcn(undoImplementations , changed, listindex) ; 
undoimplementations (changed, listindex) ; 
c->unimplement 0 ; 

> 

); 

return  false; 


1648  void 

1649  f indVariableAliases 0 

1650  ■{ 

1651  //  for  each  call  site 

1652  //  alias  the  formal  and  the  actual 

1653  II  until  done 

1654  //  for  each  variable 

1655  //  merge  alias  sets. 

1656  > 

1657 

1658  void 

1659  assignEverythingToProf ile 0 

1660  { 
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1661  Token  t; 

1662  VarDecl  vd; 

1663  forAlK't,  vd\  Vars , 

1664  vd->vd_repr  =  vd->vd_bestRepr  =  vd->vd_ADT->adt_prof ilelmpl ; 

1665  vd->implemented  =  true; 

1666  ) ; 

1667  > 

1668 

1669  int 

1670  compareCallSitesImportance (ADTcallSite  cl,  ADTcallSite  c2) 

1671  i 

1672  if  (cl->acs_rank  ==  IllegalRank) 

1673  cl->acs_rank  =  (=t=evalFcns  [cl->acs_af  cn->evalFcn->af d_uid] )  (cl) ; 

1674  if  (c2->acs_rank  ==  IllegalRank) 

1675  c2->acs_rank  =  (*evalFcns  [c2->acs_af cn->evalFcn->af d_uid] ) (c2) ; 

1676  if  (cl->acs_rank  <  c2->acs_rank)  return  1; 

1677  if  (cl->acs_rank  >  c2->acs_rank)  return  -1; 

1678  return  0; 

1679  > 

1680 

1681  void 

1682  sortByImportance (void) 

1683  { 

1684  @0  sorts  the  Set  ADTcalls  (the  Set  of  callSites)  into  the  List 

1685  @0  sortedCallSites .  the  key  is  the  frequency  of  the  call  sites. 

1686  Set_toList (ADTcalls ,  sortedCallSites); 

1687  List_sortl (sortedCallSites ,  fecompareCallSitesImportance) ; 

1688  m  better: 

1689  @@  Set_sort (ADTcalls ,  sortedCallSites,  fecompareCallSitesImportance) 

1690  @@  now  determine  how  many  of  these  items  we^re  going  to  iterate  over; 

1691  int  size  =  List_length(sortedCallSites) ; 

1692  if  (cutoff Percent  ==  100)  { 

1693  cutoff Index  =  size; 

1694  > 

1695  else  if  (cutOff Percent  ==  0)  { 

1696  cutoff Index  =  0; 

1697  > 

1698  else  { 

1699  double  rankSum  =  0; 

1700  ADTcallSite  cs; 

1701  forAlKcs,  sortedCallSites, 

1702  assert (cs->acs_rank  !=  IllegalRank); 

1703  rankSum  +=  cs->acs_rank; 

1704  ) ; 

1705  double  rankCutOff  =  rankSum  *  cutOff Percent  /  100; 
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rankSum  =  0; 
cutoff Index  =  0; 
forAlKcs,  sortedCallSites , 

if  (rankCutOff  <=  rankSum)  break; 
rankSum  +=  cs->acs_rank; 
cutoff Index++; 

); 

> 

printSortedCallSites (false) ; 

} 

void 

printAssignments (char  *foname) 

{ 

//  print  out  the  assignments 
openFile (af ile , o , "w" ,f oname , true) ; 

Token  t; 

VarDecl  vd; 

afile  «  m5Comment  «  "This  is  automatically  created  by  therblig\n" ; 
afile  «  m5Comment  «  "\n"; 

afile  «  m5PushPool_VarDecl_pool  «  "VARDECLS_HDR\n" ; 
afile  «  m5Comment2; 
forAlK't,  vd\  Vars, 

if  (vd->vd_bestRepr  ==  nil)  { 

afile  «  "!!!Not  implemented:  "  «  vd->name  «  "\n"; 
cerr  «  "!!!Not  implemented:  "  «  vd->name  «  "\n"; 

> 

else  ■{ 

@@  first,  call  the  instantiation  routine  for  the  ADT 
@@  on  this  variable.  It  will  define  the  strings  necessary 
@@  to  declare  the  variable,  make  sure  the  sources  of  the  code 
@0  exist,  and  that  the  appropriate  coercion  class  exists.; 
(*instantiationFcns [vd->vd_bestRepr->adtr_number] ) (vd) ; 
afile  «  m5Comment2  «  m5Comment  «  "  "  «  vd->name  «  "\n" 

«  m5Comment2; 

@0; 

@0  first,  put  out  the  INSTANTIATE_ADT_i  macro; 

00; 

afile  «  "INSTANTIATEC  «  vd->vd_bestRepr->name  « 

«  vd->instance_name ; 
if  ( ! vd->instance_parm->empty() ) 

afile  «  «  vd->instance_parm->string() ; 

afile  «  ")"  «  m5Comment2; 

@@  second  put  out  the  COERCE_ADT_i  macro; 
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1770  ) 

1771  afile  « 

1772  > 

1773 

1774 


afile  <<  "CQERCEC  «  vd->vd_bestRepr->name  « 

«  vd->instance_name 

«  «  vd->coercion_name->string() ; 

if  ( !  vd->coercion_parni->empty  0 ) 

afile  «  «  vd->coercion_parm->string() ; 

afile  <<  ")"  «  m5Comment2; 

@0; 

00  third  put  out  the  DECLARE_M  macro  for  the  variable  itself. 

afile  <<  m5DECLARE_M  «  vd->name  « 

«  vd->coercion_name->string() ; 
if  ( ! vd->constructor_parms->empty 0 ) 

afile  «  «  vd->constructor_parms->string() ; 

afile  <<  «  m5Comment2; 

afile  <<  m5Comment2; 

@@  Token  t; 

@@  forAlKt,  vd->vd_adtParms ,  api,  afile  «  «  t;  ); 

> 

"VARDECLS_TLR\n"  ; 


1775  void 

1776  initialize  0 

1777  { 

1778  dontCareADT  =  isADType (dontCareToken) ; 

1779  dontCareRepr  =  isADTRepr (dontCareToken) ; 

1780  Map_def ine (dontCareADT->adt_reprs ,  dontCareToken,  dontCareRepr); 

1781 

1782  dontCareADT->adt_inited  =  true; 

1783  dontCareADT->adt_number  =  0x7FFFFFFF ; 

1784 

1785  dontCareRepr->adtr_of  =  dontCareADT; 

1786  dontCareRepr->adtr_inited  =  true; 

1787  dontCareRepr->adtr_number  =  0x7FFFFFFF ; 


1788  > 

1789 

1790  #define  ReadThisForAllFiles (varname ,f cn)  \ 

1791  fnp  =  argv;  do  -[  \ 

1792  Token  t_f;  \ 

1793  strcpyCfbuf,  =t=fnp) ;  \ 

1794  strcat(fbuf,  "_");  \ 

1795  strcat(fbuf,  varname);  \ 
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t_f  =  new  Token_obj (*fnp,  id_tkn,  0);  \ 

if  (DebugFiles)  {.  \ 

cerr«"Doing  function  "«#fcn«"  on  file  "«fbuf«"\n";  \ 

>  \ 

f cn(t_f  jfbuf ) ;  ]■  while  (*++fnp  !=  0) 

main(int  argc,  char  **argv) 

Iterator  scs_iter; 
char  **fnp; 
char  f buf  [64] ; 

char  *vardeclsName  =  "vardecls .m5" ; 
curAssignCost  =  -1.0; 
argv++;  argc — ; 

cutoff Percent  =  0;  @0  this  should  yield  the  same  results  as  the  original 
@0  version; 

while  (argv[0]  !=  0  &&  argv[0]  [0]  ==  ch_minus)  {. 
char  *cp  =  &argv[0]  [1] ; 
if  (*cp  ==  ch_D)  {. 

while  (*++cp  !=  0)  { 

if  (*cp  ==  ch_a)  {  DebugAssign  =  true;  } 

else  if  (*cp  ==  ch_c)  {.  DebugCosts  =  true;  } 

else  if  (*cp  ==  ch_f)  {.  DebugFiles  =  true;  } 

else  if  (*cp  ==  ch_i)  {.  Debuginput  =  true;  } 

else  if  (*cp  ==  ch_s)  {  DebugSortCallSites  =  true;  } 

else  if  (*cp  ==  ch_d)  {  DebugDetails  =  true;  } 

#ifdef  DBG_MALLOC 

else  if  (*cp  ==  ch_m)  -[  DebugMalloc  =  true;  malloc_debug(l) ;  } 

else  if  (*cp  ==  ch_M)  -[  DebugMalloc  =  true;  malloc_debug(2) ;  } 


#endif 


else  fatal ("Unknown  debugging  flag"); 

> 


else  if  (*cp  ==  ch_o)  {.  vardeclsName  =  argv[l]  ;  argv++;  } 
else  if  (*cp  ==  ch_P)  {. 

if  (*(cp+l)  !=  ch_null)  {. 

cutoff Percent  =  atoi(cp+l); 

> 

else  { 

cutoff Percent  =  atoi (argv [1] ) ;  argv++; 

> 

assert  (  0  <=  cutOff Percent  &&  cutOff Percent  <=  100  ); 
if  (cutoff Percent  <  0)  cutOff Percent  =  0; 
if  (cutoff Percent  >  100)  cutOff Percent  =  100; 

> 
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else  fatal ("Unknown  argument"); 
argv++ ; 
argc— ; 

} 

if  (argc  <  1)  fatalC'Must  specify  file  to  work  on"); 
initialize  0 ; 

readADTs ("ADTs . th") ;  @@  this  should  really  be  compiled  in  and  not  read, 
prof ileDataValid  =  true; 

ReadThisForAllFiles ("prof Data. dat" ,readProf ileData) ; 

ReadThisForAllFiles (" 'ADT_vars^ .th" ,readVarDecls) ; 

ReadThisForAllFiles (" 'ADT_ufcns^ .th" ,readUserFcnDecls) ; 
ReadThisForAllFiles (" 'ADT_csites^ .th" ,readADTcallSites) ; 
ReadThisForAllFiles (" 'ADT_uf calls ^ .th" ,readUserFcnCalls) ; 
f indVariableAliases 0 ; 
if  (prof ileDataValid)  { 

cout  «  "Attempting  assignment  ...\n"; 
sortByImportance 0 ; 

cout  «  "Cutoff:  "  «  cutoff Percent  «  "  results  in  " 

«  cutOffIndex  «  "  of  "  «  List_length(sortedCallSites) 

«  "  being  recursed  over.\n"; 

List_iterlnit (sortedCallSites ,  scs_iter) ; 
assignable (scs_iter,  0,  0.0);  @@  always  returns  false 
if  (curAssignCost  >=0.0)  { 

cout  «  "Writing  "  «  vardeclsName  «  ".\n"; 
printAssignments (vardeclsName) ; 

> 

else  { 

cout  «  "\nCannot  assign  for  some  reason: 

assigning  default  profiling  implementations\n 
assignEverythingToProf ile 0 ; 
cout  «  "Writing  "  «  vardeclsName  «  ".\n"; 
printAssignments (vardeclsName) ; 
exit  (1) ; 

> 

List_iterCleanup(sortedCallSites ,  scs_iter) ; 

> 

else  {. 

cout  « 

"No  profile  data:  assigning  default  profiling  implementations\n 
assignEverythingToProf ile 0 ; 
printAssignments (vardeclsName) ; 

> 

/*  */  MALLOCK; 

if  (DebugAssign  |  |  DebugCosts)  -[ 

cerr«"Nof  VarDecls  =  "«  VarDecl_obj  :  : vd_R. last  +  1  «"\n"; 
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1884  cerr«"Nof  ADTabsFcns  =  "«  ADTabsFcn_obj :  :adtaf_R.last  +  1  «"\n"; 

1885  cerr«"Nof  ADTimpFcns  =  "«  ADTimpFcn_obj :  :afd_R.last  +  1  «"\n"; 

1886  cerr«"Nof  ADTcallSites  =  "«  ADTcallSite_obj  :  :acs_R.last  +  1  «"\n"; 

1887  > 

1888  exit  (0) ; 

1889  > 

1  00  FILE:  Tokens. t 

2  #define  TQKENS_MAIN 

3  #include  <stream.h> 

4  #include  <ctype.h> 

5  #include  "util.H" 

6  #include  "Tokens. H" 

7  #include  "charClasses .h" 

8  #include  "Tokens_ADTs .H" 

9  #include  "userTypes .H" 

10 

11  String_obj : : ~String_obj 0 

12  { 

13  delete  [maxlen] str ; 

14  > 

15 

16  void 

17  String_obj :: realloc (int  add) 

18  { 

19  //  assumes  that  len  is  already  set  to  new  length 

20  //  if  add  ==  0,  then  this  copies  the  string. 

21  char  *cp  =  new  char  [len]; 

22  strcpyCcp,  str); 

23  delete  [len-add] str ; 

24  str  =  cp; 

25  maxlen  =  len; 

26  > 

27 

28  String_obj& 

29  String_obj : : operator« (String_obj&  t) 

30  { 

31  if  (maxlen  <=  (len  +=  t.len))  { 

32  realloc (t . len) ; 

33  > 

34  strcat(str,  t.str); 

35  return  *this; 

36  > 

37 

38  String_obj& 

39  String_obj : : operator« (char  *cp) 


173 


40  { 

41  int  1  =  strlen(cp) ; 

42  if  (maxlen  <=  (len  +=  1))  ■{ 

43  realloc (1) ; 

44  > 

45  strcat(str,  cp) ; 

46  return  *this; 

47  > 

48 

49  String_obj& 

50  String_obj  :  :  operator« (int  i) 

51  { 

52  char  buf  [32] ;  //  should  be  big  enough  for  an  int 

53  sprintf  (buf ,  "y„d"  ,  i) ; 

54  int  1  =  strlen(buf ) ; 

55  if  ((len  +=  1)  >=  maxlen)  {. 

56  realloc (1) ; 

57  > 

58  strcat (str ,buf ) ; 

59  return  *this; 

60  > 

61 

62  String_obj& 

63  String_obj : : operator« (Token_obj&  T) 

64  i 

65  if  ((len  +=  T.len)  >=  maxlen)  { 

66  realloc (T . len) ; 

67  > 

68  strcat(str,T.str) ; 

69  return  *this; 

70  > 

71 

72  #define  HASH_SZ  MAX_N0F_TaKENS 

73 

74  #define  N0F_SECHASH  16 

75  static  Token  hash [HASH_SZ] ; 

76  static  int  hashprime  [N0F_SECHASH]  =  {. 

77  13,31,41,71,131,139,149,157,163,173,227,373,389,457,461,499 

78 

79  @0  Tokens  consist  of  a  type  (id_tkn,  punct_tkn,  num_tkn,  . . . 

80  @0  string. 

81 

82  Token_obj : : Token_obj (char  *sp,  typetype  t,  int  v) 

83  { 

84  int  hashv; 


and  a 
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85 

86 

87 

88 

89 

90 

91 

92 

93 

94 

95 

96 

97 

98 

99 
100 
101 
102 

103 

104 

105 

106 

107 

108 

109 

110 
111 
112 

113 

114 

115 

116 

117 

118 

119 

120 
121 
122 

123 

124 

125 

126 

127 

128 
129  > 


int  i ; 

int  1  =  strlen(sp) ; 
int  first,  phash,  shash; 
char  *cp  =  sp; 

Token  T; 
bool  done; 
phash  =  0; 

for  (i=0;  i  <  1;  i++)  { 

phash  =  (phash  «  1)  +  *cp++; 

> 

phash  +=  (int)t  +  (phash  »  8); 
first  =  phash  &  (HASH_SZ-1) ; 

shash  =  hashprime [(phash  »  10)  &  (N0F_SECHASH  -  1)]; 
done  =  false; 
do  -[ 

if  ( (T=hash [first] )  ==  0)  { 

@0  not  in  table; 
if  (this  ==  0)  {. 

T  =  hash[first]  =  (Token)  new  char [sizeof (Token_obj)] ; 

> 

else  T  =  hash  [first]  =  this; 

T->str  =  new  char  [1+1]; 
strcpy(T->str ,  sp) ; 

T->hashv  =  phash; 

T->val  =  v; 

T->len  =  1; 

T->type  =  t; 

T->uid  =  Token_nextuid++; 
done  =  true; 

> 

else  ■{ 

if  (T->hashv  ==  phash  &&  T->len  ==  1  &&  strcmp(T->str ,  sp)  ==  0) 
@0  found  it  in  the  table 
done  =  true; 
else  {. 

first  =  (first  +  shash)  &  (HASH_SZ-1) ; 
if  (first  ==  (phash  &  (HASH_SZ-1) ) )  { 
cerr  «  "Token  table  f ull\n" ; 
exit (1) ; 

> 

> 

> 

y  while  ( ! done) ; 
this  =  T; 
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130 

131  Token  Token_obj : : suff ix() 

132  { 

133  II  return  the  suffix  of  the  string  of  the  form:  xxxx_ss 

134  char  *cp  =  strrchr(str,  ch _ ); 

135  if  (cp  ==  nil)  return  new  Token_obj ,  id_tkn,  0); 

136  return  new  Token_obj (cp+1 ,  id_tkn,  0); 

137  > 

138 

139  Token  Token_obj : : append(char  *sfx) 

140  { 

141  char  buffer  [256]; 

142  strcpyCbuffer,  str) ; 

143  strcat (buff er ,  sfx) ; 

144  return  new  Token_obj (buff er ,  type,  0); 

145  > 

146 

147  ostreamfe 

148  operator« (ostream  &s.  String  str) 

149  { 

150  if  (str  ==  0)  {  return  s  «  "<string  not  allocated !>" ;  } 

151  if  (str->str  ==  0)  {  return  s«"<char  array  not  allocated  for  string !>";  } 

152  else  return  s  «  str->str; 

153  > 

154 

155  ostreamfe 

156  operator« (ostream  &s.  Token  gt) 

157  { 

158  gt->print (s) ; 

159  return  s; 

160  > 

161 

162  static  char  buf [256] ; 

163 

164  #define  casech(ch,tkn)  case  ch:-{=t=cp++  =  c;  s.get(c);  toktype  =  tkn;  break;} 

165 

166  istreamfe 

167  operator» (istreamfe  s,  Tokenfe  gt) 

168  { 

169  char  c; 

170  char  *cp; 

171  typetype  toktype; 

172  int  tokval; 

173  int  toklen; 

174  cp  =  buf; 
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175  LOOP: 

176  if  (Is.goodO)  { 

177  if  (s.eofO) 

178  cerr  «  "Tokens»  should  never  reach  EOF  in  any  input  files\n"; 

179  if  (s.failO)  cerr  «  "Token  »  =>  _f ail?\n"  ; 

180  if  (s.rdstateO  ==  _bad)  cerr  «  "Token  »  =>  _bad?\n" ; 

181  cerr  «  "Token  »  not  good!\n"; 

182  exit  (1) ; 

183  > 

184  s . get  (c) ; 

185  switch  (c)  {. 

186 

187  case_is_C_f irstid: 

188  { 

189  *cp++  =  c; 

190  s . get (c) ; 

191  while  (s.rdstateO  !=  _eof  tt  (isalnum(c)  | |  c  ==  ch _ ))  { 

192  *cp++  =  c; 

193  s . get (c) ; 

194  > 

195  toktype  =  id_tkn; 

196  break; 

197  > 

198 

199  case  ch_twiddle : 

200  c  =  ^  O 

201  /*  NOTE  fall  through!  */ 

202 

203  case_isspace : 

204  goto  LOOP; 

205 

206  case_isdigit : 

207  { 

208  tokval  =  c  -  ch_0; 

209  *cp++  =  c; 

210  s . get (c) ; 

211  if  (c  ==  ch_x  I  I  c  ==  ch_X)  {.  @(§  hex  numbers 

212  =t=cp++  =  ch_x; 

213  s . get  (c) ; 

214  while  (s.rdstateO  !=  _eof  &&  (isxdigit (c) ) )  {. 

215  int  h  =  c  -  ch_0; 

216  if  (h  >  9)  h  =  c  -  ch_A  +  10; 

217  if  (h  >  15)  h  =  c  -  ch_a  +  10; 

218  tokval  =  tokval*16  +  h; 

219  *cp++  =  c; 
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220 

s.get(c) ; 

221 

> 

222 

> 

223 

else  if  (isdigit(c))  {. 

224 

tokval  =  tokval  *  10  +  c  - 

225 

*cp++  =  c; 

226 

s.get(c) ; 

227 

while  (s.rdstateO  !=  _eof  . 

228 

tokval  =  tokval  *  10  + 

229 

*cp++  =  c; 

230 

s.get(c) ; 

231 

> 

232 

> 

233 

toktype  =  num_tkn; 

234 

break; 

235 

> 

236 

casech(ch_dot ,punct_tkn) ; 

237 

casech(ch_colon,punct_tkn) ; 

238 

casech(ch_semi ,punct_tkn) ; 

239 

casech(ch_quest ,punct_tkn) ; 

240 

casech(ch_comma,punct_tkn) ; 

241 

casech(ch_eq,punct_tkn) ; 

242 

casech(ch_minus ,punct_tkn) ; 

243 

casech(ch_plus ,punct_tkn) ; 

244 

245 

default : 

246 

*cp++  =  c; 

247 

s.get(c) ; 

248 

toktype  =  unknown_tkn; 

249 

error ("Unknown  token"); 

250 

*cp  =  ch_null; 

251 

error (buf ) ; 

252 

> 

253 

254 

if 

(s.rdstateO  !=  _eof)  s.putback( 

255 

^cp 

=  ch_null; 

256 

toklen  =  cp  -  buf; 

257 

gt 

=  new  Token_obj (buf ,  toktype,  to! 

258 

if 

(Debuginput)  { 

259 

cerr  «  buf  «  "  " ; 

260 

if  (gt  ==  commaToken  | |  gt  ==  S' 

261 

> 

262 

return  s; 

263  > 

isdigit(c))  { 
-  ch_0; 


